MmPalateMiRNA, an R package compendium illustrating analysis of miRNA microarray data

Background MicroRNAs (miRNAs) constitute the largest family of noncoding RNAs involved in gene silencing and represent critical regulators of cell and tissue differentiation. Microarray expression profiling of miRNAs is an effective means of acquiring genome-level information of miRNA activation and inhibition, as well as the potential regulatory role that these genes play within a biological system. As with mRNA expression profiling arrays, miRNA microarrays come in a variety of platforms from numerous manufacturers, and there are a multitude of techniques available for reducing and analyzing these data. Results In this paper, we present an analysis of a typical two-color miRNA microarray experiment using publicly available packages from R and Bioconductor, the open-source software project for the analysis of genomic data. Covered topics include visualization, normalization, quality checking, differential expression, cluster analysis, miRNA target identification, and gene set enrichment analysis. Many of these tools carry-over from the analysis of mRNA microarrays, but with some notable differences that require special attention. The paper is presented as a “compendium” which, along with the accompanying R package MmPalateMiRNA, contains all of the experimental data and source code to reproduce the analyses contained in the paper. Conclusions The compendium presented in this paper will provide investigators with an access point for applying the methods available in R and Bioconductor for analysis of their own miRNA array data.

the diseased heart [11] and diseased neural tissue [12]. Thus, microRNA gene expression profiling offers an effective means of acquiring novel and valuable information regarding the expression and regulation of genes, under the control of miRNAs, in a variety of biological systems.
The R software programming language [13] has gained wide popularity among the scientific research community, along with its extension to the realm of genomics applications via the Bioconductor [14,15] software for bioinformatics project. The Bioconductor project contains a variety of R packages for application to high-throughput "omics" data, including array preprocessing and normalization, identification of differentially expressed genes, clustering, classification, gene-set enrichment analysis, and other down-stream analysis methods. Hence, the R packages available at Bioconductor can provide a complete suite of tools for analyzing array data from the initial preprocessing steps through the final determination of interesting genes and gene sets. Several publications have addressed how to perform and reproduce an analysis http://www.scfbm.org/content/8/1 of mRNA expression array data using software from R and Bioconductor [16][17][18]. An integrated way to present the analysis from these experiments is in the form of a compendium [17,18], which encapsulates the primary data, supporting software, statistical analysis, and document text in a manner that allows other investigators to completely reproduce the results of the experiment.
While many of the same tools for analyzing mRNA expression arrays can be applied to the analysis of miRNA data, there are distinct differences between the two platforms which necessitate special use of some methods (see overviews by Sarver [19] and Thomson et al. [20]). In particular, miRNA arrays typically have far fewer genes that are spotted on the array compared to mRNA arrays and require careful consideration of the assumptions behind array preprocessing methods prior to their application. Several recent publications have compared various normalization methods for microRNA microarray data [21][22][23], while others have developed novel methods specifically for miRNA data [24][25][26][27]. Though certain methods were found to outperform others in each case, in general there is still no consensus on the best normalization method. Therefore, investigators are encouraged to perform their own assessments to determine an appropriate normalization method for their data. A second unique aspect of miRNA analysis relative to mRNA analysis is that differentially expressed miRNAs are subsequently evaluated for potential gene targets that are regulated by the miRNAs. A number of databases can be used for this purpopse, and many of these have been ported to R in the form of Bioconductor packages. It is these putative regulatory targets that are typically evaluated for biological and molecular functionality, e.g. by gene set enrichment analysis.
In this article, we illustrate how to analyze a two-color miRNA experiment using available packages from Bioconductor and the Comprehensive R Archive Network (CRAN). Example code is provided for the complete analysis including preprocessing of arrays, normalization, identification of differentially expressed miRNAs, clustering, miRNA target identification, and gene set enrichment analysis. The analysis presented here follows closely to what was presented by Mukhopadhyay et al. [28]. Aspects of miRNA analysis which require special attention are highlighted, as are particular advantages of using specific R and Bioconductor packages. Although the analysis is specific to the Miltenyi Biotech miRXplore platform [29], the general steps outlined here can easily be extended to other platforms as well. To ensure reproducibility of the results, the entire analysis is presented as a compendium [17,18], in the form of an accompanying R package called MmPalateMiRNA [30], which has been made freely available on Bioconductor. The package also includes several functions to produce diagnostic plots for evaluating probe intensity distributions on miRNA microarrays, as discussed in Sarkar et al. [26]. The experimental data used in this manuscript are freely available as part of the compendium package (GEO DataSets [31], accession number GPL10179).

Methods
In the following subsections, we discuss the methodologies used for the analysis of the miRNA data in this compendium. We refer the reader to the original papers for detailed methods, here just providing an overview.

Preprocessing
An important first step in the analysis of microarray data is to check the array quality by inspecting for outliers, spatial artifacts, and for differences in array intensity distributions which may require normalization. Several software packages exist for this purpose; in particular, the arrayQualityMetrics package [32] available from Bioconductor provides a comprehensive report for both one and two-color microarray data. However, the diagnostic plots in that package for two-color arrays are constructed from ratios of the two channels (M values), and for miRNA data plots focused solely on the control / reference channel may be more relevant. Specifically, Sarkar et al. [26] introduced novel diagnostic plots for miRNA data for the purpose of evaluating and comparing different normalization methods, which serve as useful indicators for array quality and outlyling arrays. In addition to evaluating array quality, other important pre-processing steps include identifying outlying values for specific probes, performing non-specific filtering of probes, and imputing probes that are missing or are extreme outliers.

Normalization
Several recent publications have drawn attention to the normalization of miRNA data as distinct from that of mRNA data. In particular, methods that assume some level of symmetry in differential expression, such as loess and quantile normalization, may be inappropriate when global changes associated with phenotypes are present [20]. As such, normalization methods that use a set of invariant probes [23,26], or use single-channel normalization methods [21] may outperform so-called "global" normalization methods. Recent comparisons of normalization methods for miRNA microarray data have resulted in differing conclusions [21][22][23], with top performing methods ranging from quantile normalization for single-channel array data [22] to print-tip loess for two-channel data [21]. However, Sarkar et al. [26] evaluated several different normalization methods, incluuding variance stabilizing normalization (VSN) [33], spike-in VSN, and print-tip loess, and found no statistically significant differences between them based on correlation with http://www.scfbm.org/content/8/1 qRT-PCR measurements. As is typical with array data in general, investigators are encouraged to try several different normalization methods and evaluate the differences betweeen them on the basis of diagnostic plots [26].

Differential expression
A variety of methods exist to determine differential expression between two or more groups of expression data, including the classic t-test and the more recent 'moderated' variants. Members of the latter category include the Significant Analysis of Microarrays (SAM) [34], and empirical Bayes methods [35,36]. In particular, the methodology developed by Smyth [36] extends these concepts to apply to general microarray experiments with arbitrary numbers of treatments and samples, in the context of a hierarchical linear model. A model is fitted to the expression values for each gene/transcript, and used to evaluate differential expression for contrasts (comparisons between treatment groups) of interest. A 'shrinkage' estimate of the variability is obtained by a weighted average of the a pooled estimate of variation and the per-gene estimate of variation. This lessens the occurrence of large t-statistics due to exceptionally small variance estimates, and effectively introduces a "fold-change" criterion into the statistic. The methods are available in the Bioconductor package limma [37].

Clustering
Clustering of array profiles is helpful for determining underlying structure in the changes of gene expression, especially for time course data. Common methods include hierarichical clustering, divisive hierarchical clustering (DIANA), K-means, self-organizing maps (SOM), the selforganizing tree algorithm (SOTA), partitioning around medoids (PAM), and model-based clustering [38][39][40]. With the diversity of methods available for the investigator to try, a commonly encountered difficulty is determining which clustering algorithm to use for a particular data set. This problem can be partially overcome using clustering validation measures, as found in the clValid package [41]. The clValid package allows the user to select from among ten different clustering algorithms and uses three different sets of validation measures (internal, stability, and biological) to evaluate the performance of each algorithm for a range of cluster numbers.

Identification of miRNA target genes
After a subset of miRNAs of interest has been determined, e.g by differential expression or clustering, the next step is to determine the potential regulatory targets of the miRNA molecules. Algorithms for predicting miRNA target molecules are fundamentally based on sequence complimentarity (between the mature miRNA transcript and the 3'-untranslated regions of potential target mRNAs), species conservation, thermodynamic stability, and site accessibility (see Alex et al. [42] for an overview). The Bioconductor package RmiR.Hs.miRNA [43] contains six databases for human miRNA targets, while the database of targets in miRBase [44] is available through the Bioconductor packages mirbase.db [45] and microRNA [46]. The TargetScan database of miRNA targets [47] is also available in targetscan.Hs.eg.db [48] for humans and targetscan.Mm.eg.db [49] for mouse.

Gene set analysis
Once putative regulatory targets of the differentially expressed miRNAs have been identified, a logical next step is to identify what biological or functional pathways the targets have in common with each other. This can be accomplished by gene set analysis, or gene set enrichment analysis [50]. The regulatory targets are compared with predefined gene sets such as GO classifications [51], KEGG pathways [52], chromosome bands, and protein complexes. Gene set analysis is based on the hypergeometric test and identifies which biological gene sets have an under-and over-representation of the identified miRNA targets. Bioconductor packages which provide gene set analysis include GOstats [53] and Category [54].

Results and discussion
Preliminaries R packages that are needed for running the example code in this manuscript are MmPalateMiRNA [30] and its dependencies, and the additional packages latticeExtra [55], clValid [41], targetscan.Mm.eg.db [49], microRNA [46], org.Mm.eg.db [56], and GOstats [53]. The full list of dependencies is given in the Availability and requirements. To begin, we download and install all of the needed packages for running the code in this compendium. In the following, text after the R> prompt denotes an R command, and a "+" denotes a continuation in code. The R code from this compendium is available as Additional file 1 ("MmPalateMiRNA SCBM.R").

miRNA data
The microRNA microarray data in this compendium were obtained as previously described in Mukhopadhyay et al. [28], and the data are publicly available from GEO [31] (accession number GPL10179). Briefly, mouse embryonic tissue was obtained on gestational days (GD) 12, 13, and 14, which represents the critical period of palate development in the mouse. Total RNA (containing miRNAs) was isolated using standard RNA extraction protocols. RNA samples (1 μg) isolated from mouse embryonic orofacial tissues (GD-12 -GD-14) as well as the miRXplore Universal Reference (UR, control channel) were fluorescently labeled with Hy5 (red) or Hy3 (green), respectively, and hybridized to Miltenyi Biotec miRXplore Microarrays using the a-Hyb Hybridization Station [29]. For each gestational day, three distinct pools of RNA were independently processed and applied to microarray chips. Probes for a total of 1336 mature miRNAs (from human, mouse, rat and virus), including positive control and calibration probes, were spotted in quadruplicate on each microarray. Each array included probes for 588 murine miRNAs. The miRXplore Universal Reference (UR) controls, provided by Miltenyi, represent a defined pool of synthetic miRNAs for comparison of multiple samples. Fluorescence signals of the hybridized miRXplore Microarrays were detected using a laser scanner from Agilent Technologies. Mean and median signal and local background intensities for the Hy3 and Hy5 channels were obtained for each probe on each of the nine microarray images using the ImaGene software [60]. The experimental data is included in the MmPalateMiRNA package in a compiled format, as an RGList object (a class in package limma [37] for twocolor microarray data) called PalateData. The data is loaded into the R session using the code below. To see how PalateData was created from the source data files, see Additional file 2 ("ReadingTwoColorData.pdf ") and the corresponding R code in Additional file 3 ("Read-ingTwoColorData.R"). For more information on the data in PalateData, use ?PalateData or see Additional file 2.

Outlying arrays
Sarkar et al. [26] described several diagnostic plots for miRNA data that can be used to evaluate the need and effectiveness of normalization procedures. These plots can also serve as aids to determine outlying arrays and batch effects. One such plot is the kernel density estimate for each array, for different types of probes. Figure 1 plots the density estimates of the log 2 intensity values in the control channel for the unnormalized data, separated into "MMU miRNAs" (MMU = Mus musculus, i.e. mouse), "Other miRNAs", and "Control" probes (other probes were non-informative). The plot requires use of the lattice package, and the MmPalateMiRNA package contains methods to produce plots for RGList objects based on the generic functions in lattice. The code below illustrates the use of the function densityplot to produce Figure 1. To access the documentation file for this function, use ?densityplot (in general, the documentation file for function fun is accessed through ?fun, and the documentation file for S4 class obj is accessed through class?obj).

Outlying values
In addition to checking for outlying arrays, it is important to check for outlying values on individual probes. To accomplish this, we evaluated for each probe whether there were any extreme values (greater than 2.665 standard deviations above the mean). The checkOutliers function checks this for each of the red and green channels in an RGList object and returns the indices of array probes with extreme values.

MMU miRNAs
Other miRNAs

Figure 2
Distance between arrays prior to normalization. Distance between arrays in the PalateData miRNA data from the MmPalateMiRNA package. Distance was based on the median of the absolute differences in unnormalized log 2 intensity values of the reference channel. Separate panels are provided for "MMU miRNAs", "Other miRNAs", "Control", and "Empty" probes. Arrays have been reordered so that the outlying arrays (12-1, 13-2, and 14-3) are grouped together.
R> boxplot(as.data.frame(t(PalateData + $R[outliers$Rout, ]))) The figure is omitted but clearly shows that the identified outlying values are nearly two orders of magnitude above the rest of the intensity values. Rather than omitting these values, we exploit the replicated design of the arrays and substitute the mean of the other replicates on the array for the extreme values using the fixOutliers function.

Missing values
In addition to checking for outlying values, we also check for any missing values in the two channels using the checkMVs function. Here, we only find two probes on the array with missing values in the background channels, so we again impute these values using the means of the backgrounds from the other three replicates on the chip using the fixMVs function.

Filtering probes
Prior to running the normalization methods, we filter the probes and keep only those which correspond to miR-NAs and calibration probes. Additionally, probes that are not sufficiently above the background intensity level may be unreliable and represent noise that can interfere http://www.scfbm.org/content/8/1 with subsequent analysis, including normalization [26]. Prefiltering also reduces the number of statistical comparisons being performed and improves overall power [61]. Here, we filter probes whose foreground intensity values are below 1.1 times their background intensity level. To allow for probes which may be expressed for a particular experimental condition (here, gestational day), we keep all probes which have at least 3 samples above the filtering threshold. Lastly, only those genes with all four replicates passing the filtering step are retained. After all pre-processing steps, a total of 956 probes, corresponding to 175 mouse miRNAs, 42 other miRNAs, and 22 calibration probes each replicated 4 times, remain.

Normalization
Based on the literature [21][22][23]26], we evaluated several normalization procedures on the filtered data, including none, median, loess, quantile, VSN, and spikein VSN. The limma package [62] includes various options for both within (normalizeWithinArrays) and between (normalizeBetweenArrays) array normalization, and the vsn package [33] has functions for performing VSN and spike-in VSN. In all cases, a simple background correction was performed by subtracting background from the foreground intensities.

Diagnostic plots
Several diagnostic plots can be used to contrast the effectiveness of each normalization procedure. The MmPalateMiRNA package contains several methods to produce these plots for lists of class MAList or NChannelSet objects, based on functions in the lattice package. Figure 3, rows one through five, plots the intensity distribution for the reference channels after each of the normalization procedures (use of the useOuterStrips function requires the latticeExtra package). Note that the order of panels in lattice plots is from the bottom left panel to the right and up, according to the rules used for graphs. The quantile normalization procedure is clearly the most successful in removing the intensity bias that was apparent for three of the arrays (12-1, 13-2, and 14-3), while loess and median normalization appear to be the least successful. Notably, normalization based on the spike-in probes was unsuccessful, perhaps since these probes were shifted differently in the reference channel relative to the other probe types.
R> res <-levelplot(ndata.all, + channel="G", + order=c(1,5,9,2:4,6:8), + scales = list(rot=c (45,45))) R> print(res) http://www.scfbm.org/content/8/1 To investigate the effect of the normalization procedure on the experimental channel, plots of the spread (median absolute deviation) versus the location (median) of all probes can be used. Plots of this type can be produced using the MADvsMedianPlot function in the MmPalateMiRNA package. Probes of different types are highlighted, with particular focus on the spike-in probes, which should have low variability across all the arrays. In Figure 5, spike-in VSN has the lowest variability among the spike-in probes, compared to the other normalization methods. However, spike-in VSN has also dramatically decreased the variation among all the probes in the experimental channel, making the normalization procedure questionable in this case. Quantile normalization has resulted in large variations for some of the probes with lower intensity values.
R> res <-MADvsMedianPlot(ndata.all, + channel = "R", group = "probe.type", + pch = 20, subset = c("MMU + miRNAs", "Other miRNAs", + "Control"), + key = list(points = list(pch = + rep(20, 3), + col = trellis.par.get() + $superpose.symbol$col[1:3]), + text = list(c("MMU miRNAs", + "Other miRNAs", "Control")), + columns = 3)) R> print(res) Plots of the log 2 intensity ratios (M values) versus the mean log 2 intensity values (A values) for each probe can be used to evaluate whether there is a bias associated with overall intenstity level for each array. This so-called "MA" plot is illustrated in Figure 6 for quantile normalization. MA plots for the other normalization methods are not shown, though code to produce the plots is available in http://www.scfbm.org/content/8/1 Median of absolute differences in log2 expression Distance between arrays in reference channel after normalization. Distance between arrays in the PalateData miRNA data from the MmPalateMiRNA package, both before ("None") and after normalization by various normalization procedures. Distance was based on the median of the absolute differences in log 2 intensity values of the reference channel, for all probes remaining after filtering. Arrays have been reordered so that the outlying arrays (12-1, 13-2, and 14-3) are grouped together.
the accompanying R script "MmPalateMiRNA SCBM.R". Quantile normalization has removed any association between the M and A values, while for VSN normalization there is still a trend which is similar to the unnormalized data. The MA plot for spike-in VSN shows a dramatic effect on the intensity ratios.
R> res <-MAplot(ndata.quantile, pch = + 20) R> print(res) As a final evaluation, we inspected heatmaps along with hierarchical clustering of the arrays. Figure 7 displays the heatmap after quantile normalization and reveals that the previously identified outlying arrays (samples 12-1, 13-2, and 14-3) still do not cluster with the other replicates for that day.
R> heatmap(ndata.quantile$M, col = + cm.colors(256), labRow = FALSE) Table 1 gives the correlations between each pair of arrays, based on the log 2 intensity ratios. Since the other two replicates for each day were highly correlated (r ≥ 0.95), we decided to use only those two replicates from each day for subsequent statistical analysis. Normalization was redone omitting the arrays 12-1, 13-2, and 14-3, using quantile normalization.

Imputation
Sixteen probes from the six arrays exhibited negative intensities after the background procedure and resulted in missing values for subsequent calculation of the log 2 intensity ratios. A significant percentage of missing values can have a substantial impact on downstream analysis of array data [63], and in such cases choice of a imputation procedure should be carefully considered. Here, with a relatlively small percentage of missing values, the impact on data analysis will be relatively minimal. Hence we use the K-nearest neighbor imputation scheme [64] as a fast and effective approach, implemented in the imputeKNN function included in package MmPalateMiRNA. R> ndata$A <-imputeKNN(as.matrix + (ndata$A), dist = "cor")

Determining differentially expressed miRNAs
To test for differential expression of miRNAs between different gestational days (GD-12, 13, and 14), the limma package [36,37] was used. Use of the limma package requires the user to create a design matrix, which defines the possible levels for each experimental factor, and is used to construct a model matrix and contrasts to test for differential expression between factor levels. The model matrix consists of indicator variables for the levels of each experimental factor in our design, which in our case corresponds to each of the gestational days.
R> design <-data.frame(grp = c(1, 1, + 2, 2, 3, 3), rep = c(1, 2, + 1, 2, 1, 2)) R> design$grp <-factor(design$grp, + labels = c("Day12", "Day13", + "Day14")) R> mmat <-model.matrix( ∼ 0 + + design$grp) R> colnames(mmat) <-c("Day12", + "Day13", "Day14") Estimates of gene expression are based on the log 2 Red/Green intensity ratios, hereafter referred to as 'expression values' . Contrasts defined here estimate the differences in mean expression between each gestational day. The makeContrasts function in limma will generate these for you. Some advantages of using limma over other methods include the ability to incorporate probe quality weights and to handle duplicate probes for each miRNA on the chip via the duplicateCorrelation function [62]. These advantages are particularly evident in small sample sizes, as in this experiment. To make use of the duplicated probes, we first order the normalized data so that replicated probes are adjacent to each other. The probe quality weights are incorporated in the calculation of the correlation matrix for the duplicated probes.
R> fit <-lmFit(ndata, mmat, ndups=4, + correlation=corfit$consensus) R> fitc <-contrasts.fit(fit, + contrast.matrix) R> fitc <-eBayes(fitc) The topTable function calculates and reports fold change, moderated t-statistics, unadjusted and adjusted p-values for the comparison of interest. P-values are adjusted by the method of Benjamini & Hochberg [65], which controls the expected false discovery rate. Code below shows the calculation for the comparison between gestational days 13 and 12, and the results are given in Table 2  gestational days are omitted but code to calculate them is included in the R script "MmPalateMiRNA SCBM.R".

R> print(res, include.rownames = FALSE)
A nice summary of the results for the comparisons between gestational days is a Venn diagram, which gives the number of up-and down-regulated genes for each comparison, along with the number in the intersection of these sets (Figure 8).

Clustering expression profiles
After identifying the differentially expressed miRNAs, clustering analysis can be performed to group genes with similar trends over time. A common difficultly is deciding which clustering algorithm to use and how many clusters to create. Cluster validation measures, as contained in the R package clValid [41], can help in this regard. Below, the clValid function is used to evaluate hierarchical clustering, SOTA, DIANA, and K-means clustering algorithms, for a range of one to six clusters in each case. The expression values for each day are averaged over the two replicates prior to clustering (object aveExpr). The internal validation measures (connectivity, Dunn Index, and Silhouette Width) are used with a correlation metric. A summary of the result indicates that hierarchical clustering with six clusters provides the optimal connectivity and Dunn Index measures, while DIANA with six clusters gives the optimal Silhouette Width.

Determining miRNA target genes
To follow-up the results from the differentially expression and clustering analysis, the next step is to determine putative regulatory targets of the differentially expressed miRNAs. To illustrate, we identify the putative targets of the miRNAs contained in the first cluster in Figure 9. The miRNAs in the first cluster are http://www.scfbm.org/content/8/1 evaluated for putative targets using the databases Tar-getScan [47] (package targetscan.Mm.eg.db [49]) and miRBase [44] (package microRNA [46]). The mouse specific miRNA names are first extracted and then converted to the standard nomenclature using the function miRNames, which is included in the accompanying R script.
R> ids1 <-names(clusters[which(clusters + == 1)]) R> miRs1 <-miRNames(ids1, + avedata$genes$Name, avedata$genes + $"Gene ID") Targetscan targets are obtained using the code below. The objects in the targetscan.Mm.eg.db package are Bimap objects, which are mappings from one set of keys (the left keys or Lkeys) to another (the right keys or Rkeys). We start by mapping the miRBase identifiers to their miRNA family names, then map the miRNA families to Entrez Gene identifiers of the targets in the TargetScan database. Several of the miRNAs of interest required slight modifications to their names prior to their mapping. The TargetScan database identifies 4,640 unique Entrez Gene identifiers as putative targets.

Gene set analysis
As a final step in our analysis, we take the putative miRNA targets from the intersection of the Tar-getScan and miRBase databases and perform gene set enrichment analysis on them, using the hypergeometric test from the GOstats package [66]. Terms in the GO hierarchy are analyzed for over-representation of genes from our miRNA target list, relative to the total number from the mouse genome having that annotation. A GOHyperGParams object is created which contains the list of targets (selectedEntrezIds), the gene "universe" (entrezUniverse), the annotation database to use, the GO ontology, and direction and significance level of the test.
R> hgOver <-hyperGTest(GOparams) R> htmlReport(hgOver, file = + "hgResult.html") As a final step, we evaluate the mature miRNA sequences and seed regions of the miRNAs which target the genes in a particular GO category. To illustrate, the GO category 0007179, transforming growth factor beta receptor signaling pathway, is used. Entrez Gene IDs belonging to this category are identified and intersected with the selected Entrez Gene IDs corresponding to cluster one of Figure 9. This results in 21 identified Entrez Gene IDs.

1] 21
Next, these Entrez Gene IDs are reverse mapped back to the set of miRNAs which putatively target these genes. This produces a total of 19 identified miRNAs.

Session information
It is important to note that some of the presented results may depend on the versions of the software packages that were used to produce them. The following gives the complete information of the R session under which the presented results were obtained.

Conclusions
In this paper, we present a complete analysis of miRNA data using R and Bioconductor, including preprocessing, normalization, differential expression, clustering, identification of target genes, and gene set enrichment analysis of putative miRNA gene targets. Though there are several papers in the literature which give an overview of the analysis of miRNA data, the MmPalateMiRNA package is unique in presenting a comprehensive analysis of miRNA data which is completely reproducible. Further, while the number of packages for analyzing miRNA array data in Bioconductor is continuing to expand (see, e.g., packages LVSmiRNA [68], miRNApath [69], RmiR [70], and ExiMiR [71]), the distinguishing characteristic of http://www.scfbm.org/content/8/1 this package is that it integrates many of these recent advances into one central document. Thus, this article can serve as a template for other investigators to conduct their own analysis. Important aspects of selecting a normalization algorithm for miRNA data are illustrated, along with code for producing useful diagnostic plots to select an appropriate procedure [26]. These functions are not readily accessible to users other than through the MmPalateMiRNA package. Advantages of using the limma package to fit advanced hierarchical models for testing differential expression are documented, along with code for testing comparisons between experimental groups of interest. Lastly, we illustrate the use of miRNA target databases which have been recently ported to Bioconductor for identifying putative gene targets of selected miRNAs, as well as how to test for enrichment in biological and functional categories among the putative miRNA targets. While the analysis we present here is fairly comprehensive, it is straightforward to use other software, such as Ingenuity Pathway Analysis [72] to build from the results presented in this article (see [28] as an example). The complete analysis in this article is freely available as a compendium in the form of an R package (MmPalateMiRNA, downloadable from Bioconductor [15]), along with accompanying documentation, code, and functions to perform all of the analysis.