The Shivplot: a graphical display for trend elucidation and exploratory analysis of microarray data
© Woody and Nadon; licensee BioMed Central Ltd. 2006
Received: 30 May 2006
Accepted: 08 November 2006
Published: 08 November 2006
High-throughput systems are powerful tools for the life science research community. The complexity and volume of data from these systems, however, demand special treatment. Graphical tools are needed to evaluate many aspects of the data throughout the analysis process because plots can provide quality assessments for thousands of values simultaneously. The utility of a plot, in turn, is contingent on both its interpretability and its efficiency.
The shivplot, a graphical technique motivated by microarrays but applicable to any replicated high-throughput data set, is described. The plot capitalizes on the strengths of three well-established plotting graphics – a boxplot, a distribution density plot, and a variability vs intensity plot – by effectively combining them into a single representation.
The utility of the new display is illustrated with microarray data sets. The proposed graph, retaining all the information of its precursors, conserves space and minimizes redundancy, but also highlights features of the data that would be difficult to appreciate from the individual display components. We recommend the use of the shivplot both for exploratory data analysis and for the communication of experimental data in publications.
Microarrays  have provided a wealth of gene expression data for the biological community to interpret. The technology presents a snapshot of cellular transcription at an unprecedented level of detail, with certain array designs containing probes to assess expression levels of every known gene within the target organism's genome. Microarray data have been presented in an ever-growing number of publications, and confidence in the technology is growing as progress is made validating microarray findings.
The analysis of microarray data remains challenging, however, for a number of reasons. The technology is expensive, often restricting the number of replications that can be run in an experiment. Moreover, the validity of various statistical models and their corresponding processing algorithms for gene expression data are being actively debated [2, 3].
One of the more daunting but unavoidable aspects of microarrays, as with all high-throughput systems, is the sheer volume of data that must be examined. Researchers are unable to develop comprehensive familiarity with each of the millions of data points available in a typical microarray data set. Multi-display graphical methods are thus key for reaching an understanding of the data, for assessing analysis assumptions, and for examining the effects of data pre-processing methods.
Plotting allows the rapid and concise representation of the data as a whole, and allows trends and local aberrations to be spotted with relative ease . These trends are useful for more than just diagnostic assessments; due to the relative paucity of replicate observations in many studies, supplementary information is frequently gathered from observations with similar expression properties. This information can be used, for example, to obtain an improved estimate of inter-array variability .
A further motivation for concise and elegant graphics is the space restrictions typically imposed by scientific journals. A thorough and well designed graphic needs to be explained just once, and its subsequent re-use can allow the reader rapid insight into volumes of visual data.
The construction of comprehendible graphics, however, is a challenge of its own. Unnecessary ink must be kept to a minimum, and it is easy to obscure or even mask important points through graphs that are complex or confusing. Efficiency, readability, and relevance are of paramount concern.
With these criteria in mind, the present work illustrates an aggregate graphical approach, inspired by (but not restricted to use with) microarray data. The proposed technique pools the strengths of several other graphical methods which are commonly applied to microarray data: a boxplot, a probability density function plot, and a plot illustrating variability as a function of signal intensity. The present work demonstrates that these plots can be productively combined, facilitating integration of complementary information. Although this plot design is motivated by microarray analysis, we present this technique as a viable tool for both exploration and publication of other types of high volume multivariate data sets.
This section reviews the component plots that are assembled in our proposed graphical method. Simulated and empirical microarray data are used to illustrate the strengths and weaknesses of these respective plots when presented individually. The microarray data used in this section were obtained from the publicly available Affymetrix GeneChip spike-in data .
The five-point summary delivered in the box-and-whisker portions of boxplots allows rapid access to many aspects of the distribution. For example, indications of skewness can be seen both in the position of the median within the interquartile range box (closer to either end implicates skewness) and the lengths of the respective whiskers extending from the box (if one whisker is much longer than the other, skewness is a possible explanation).
Boxplots, however, have shortcomings. Although they are easily interpreted and are pleasing to the eye, they have poor data-ink ratio . The "box" component of a boxplot can be disassembled into its 5 composite values, with the remaining ink merely providing assistance to the eye. For example, only one half of the boxplot, as divided by the horizontal axis of bilateral symmetry through the center of the boxplot, is necessary.
Furthermore, enclosing a "boxed" region encourages the interpretation of area, a quantity which is arbitrarily determined by the boxplot's vertical width. This two-dimensional depiction only distracts from the interpretation of one-dimensional distance. Although the marked quantities and their positions relative to one another are relevant, a majority of the lines surrounding and joining these values are not. Boxplots also potentially mask underlying features of the distribution, such as tail densities and multiple modes . As an illustration of this hazard, compare the boxplot generated from a normal distribution (Figure 1a) to the boxplot of a distinctly bimodal (mixture of two non-overlapping normals) distribution (Figure 1b). From these two illustrations alone, it would be very difficult to discern that the two distributions have a different number of modes. This information can be recovered, however, through use of the density plot.
Perhaps the easiest way to introduce the density plot is to first examine the related concept of a histogram . Histograms work on a univariate data set by first stratifying the data into "bins" based on value, and then plotting the population count of each bin. The resulting graph displays how the data values are distributed; bins containing many observations have taller bars than sparsely populated bins. Histograms are excellent for detecting multiple modes, skewness, and kurtosis. Examples of histograms can be seen in Figures 1d–f.
Density plots  offer what is in essence a "smoothed" histogram; instead of using discrete bins and counts, density plots employ a continuous curve to communicate the same information. Area is used to convey the probability of observations within specified ranges. Specifically, were a sample value drawn from the depicted distribution, the probability of this value lying within a given interval can be determined by taking the integral of the curve bounded by that interval. Example density plots can be seen in Figures 1g–i. Like the histogram, density plots are excellent for detecting multiple modes, skewness, and kurtosis. The limitation of any approach that employs frequency as a metric is that extreme values and outliers are given little credence. These outlying values can carry a high leverage on important quantities such as the mean, yet they have only a subtle effect on the density curve. Although density plots do an excellent job estimating the position and number of modes, the mode is itself a highly variable estimator of the center of a distribution, and it is challenging to visually estimate the position of a mean or median from the density curve.
It is typical for the variability of replicated microarray data to exhibit a dependence on intensity, whether replication is across or within arrays and whether the variability reflects processing effects only or processing plus biological effects. One common approach is to log gene expression data, which tends to stabilize error variance across replicates for mid-to-upper range intensity values but which has the disadvantage of inflating additive error for low intensity values. Alternative approaches to microarray data transformation model both additive and proportional error components (see the "generalized log" methods of [12–16]). Variability versus intensity plots can be used to assess how successfully such transformations have stabilized the variance throughout the entire intensity range. Additionally, the plot can also be used as a visual aid for pooling estimates of random error associated with genes of similar expression intensity [3, 17, 18].
Results and discussion
As can be seen from Figure 2a–c, all three of the discussed plots – boxplot, density plot, and variability-versus-intensity (standard deviation vs mean) – have an axis in common. Figure 2d presents the shivplot, an amalgamation of the three plots. To first capture the essence of the boxplot, we eliminate its vertical width, thereby reducing it down to point components: the five distributional landmarks along with the outliers. These points can be displayed as ticks across the middle of the graph, at y = 0. Then, in the region below, we can draw the reflection of the density function plot (with the scale starting at zero and increasing downward along the lower half of the y-axis). The area enclosed by this curve below y = 0 represents the probability density. Lastly, we can draw the loess fit from the variability-versus-intensity plot in the upper half of the graph. One can quickly isolate the information in any of the three precursor graphs from the final image – there is no significant loss of information when the plots are superimposed. (The values along the right side of the graph provide maximum values: the maximum standard deviation above, and the maximum density below.)
The data interpreter needs to adjust to two aspects of this graph: The upper and lower halves of the plot have distinct y-axes, with different scales and units, and the density distribution is inverted. However, having the three graphs simultaneously available provides the advantage that each plot can readily be interpreted in the context of the other two. The advantages of this feature are best illustrated by examining empirical data, as provided in the following section.
There is an ongoing debate about the appropriate analysis of data derived from microarrays. As such, it is of great interest to biologists and statisticians alike to observe the impact of different statistical algorithms on microarray data. Having approachable visual tools with which to make such comparisons greatly expedites this tedious and iterative process.
The examples in this section come from three different data sets. To build familiarity with the shivplot, we use data from the Affymetrix U133 spike-in data set to demonstrate the appearance of the components of the shivplot, and then follow this up with their shivplot equivalent. From these shivplots, we will be able to evaluate distributional features that distinguish the expression estimates produced by different Affymetrix normalization procedures from one another.
Beyond distinguishing the nuances of algorithms, the shivplot can also be used to detect interesting distributional properties of data sets that typical exploratory graphics may understate or miss. To demonstrate this utility, we have obtained the data from two microarray studies that were specifically designed to examine technical aspects of microarray data structure. Specifically, data sets produced by  and  will be examined, as these data sets highlight unique advantages of the shivplot in an exploratory context.
Affymetrix spike-in data
As mentioned earlier, the microarray manufacturing company Affymetrix has produced spike-in data sets where the underlying transcript quantity information is engineered and thus known in advance. Such data sets have been used to design , field test , and compare  statistical algorithms. The specific details of these spike-in experiments can be found at the aforementioned website. We shall use our shivplot to compare a subset of Affymetrix analysis algorithms.
MAS 5.0 has a step in its algorithm in which unrealistically small gene expression values are replaced with imputed values. The consequences of this approach can be seen uniquely in the MAS 5.0 shivplot as a decrease in variability across replicates at the (partially imputed) low end. A similar dip in average variability can be see at the low end of the RMA intensity spectrum, although this phenomenon does not have a satisfying methodological explanation like the one offered for MAS 5.0. The comparative plot also shows the substantially lower replicate variability of RMA relative to MAS 5.0 and dChip, a feature of the RMA algorithm that has been well documented . Though dChip has relatively stable variance at the high end, there is an escalation in variability near the low end.
Kendziorski et al. pooling data
Choe et al. Golden Spike data
The final data set we examine here was the Golden Spike experiment produced for . This data set was engineered so that a vast majority of the effect sizes between treatment and control condition were known in advance, as determined by a predetermined spike-in design. Although designed to serve as a benchmarking data set for algorithm comparison, the golden spike data set produced results at odds with other benchmark data sets and has been critiqued on methodological grounds . More specifically, it was remarkable that in the comparison performed in , the standard RMA algorithm performed quite poorly, and in particular worse than MAS 5.0. This finding contrasts with results of other method comparison studies. In an attempt to identify a possible explanation for this unusual result, we apply the shivplot successively to the RMA and the MAS 5.0 processing of the Golden Spike expression data.
On one hand, Figure 8 presents a multi-sample shivplot of RMA normalized data, distinguished by group and by whether normalization was performed across conditions. It is noteworthy that the variance vs intensity relationship is quite different across conditions, even after executing the very strong quantile normalization procedure, integral to RMA, which largely homogenizes the expression measure distributions. This effect persists regardless of whether the conditions are normalized together or separately. As such, quantile normalization did not produce the variability discrepancy. An explanation of this discrepancy would be of relevance to anyone using variability to weight the significance of differential expression.
As in the analysis of the previous data set, the shivplot's simultaneous presentation of the variability-vs-intensity plot and the density distribution is clearly advantageous. Had the variability plot been examined individually, it would have been challenging to correctly predict the homogeneity of the component distributions. Similarly, it would have been difficult to predict from the highly similar distributional plots that the variability relationship would differ so markedly. The findings in this plot offer a graphical explanation why the (unusual) second round of normalization was deemed necessary by  to analyze their data.
Figure 9, on the other hand, which presents the multi-sample shivplot of the MAS 5.0 normalized data, tells a different story. In contrast to the crossing trend lines observed in Figure 8, the variance-intensity relationships for the control and treatment groups were highly similar, except for a small number of low intensity measurements.
It seems likely that either RMA has produced this unusual difference between groups (although, as we have demonstrated, the quantile normalization is unlikely to blame), or some aspect of the MAS 5.0 normalization procedure has managed to eliminate it from the raw data. While further detective work is beyond the scope of the present work, the shivplot was a pivotal tool in uncovering this intriguing phenomenon in the Golden Spike data.
R/S-PLUS code written by the authors to produce shivplots is available in a flat file [see Additional File 1]. Also provided are a tutorial demonstrating the use of the code [see Additional File 2] and example data sets formatted for use with the code [see Additional Files 3, 4]. Alternatively, the entire shivplot library is available as a package for R in either source form [see Additional File 5] or as a Windows binary [see Additional File 6].
Microarray data, like data obtained from all high-throughput assays, represent a complex system, and significant amounts of space must be allocated in each publication to introduce and familiarize the reader with the technology and methodology used in any analyses. With the space restrictions imposed by many scientific journals, it is important to present as much information as cleanly as possible in concise graphics. The present work illustrates how three graphical tools of growing importance to microarrays and other high-throughput platforms can be effectively combined into a single plot, not only saving space but facilitating insights obtainable from complementary simultaneous interpretation. Furthermore, the introduced tool is general enough to serve a variety of high-throughput purposes, both in microarray data analysis and exploratory data analysis in general.
The authors thank Carl Murie for his helpful comments on the manuscript. This work was supported by Genome Quebec Phase I Bioinformatics Consortium funding (High-Throughput Gene Expression) and Genome Canada/Genome Quebec funding (Microarray Data Analysis for Class Comparison: Methods, Software and Pedagogy).
- Schena M, Shalon D, Davis RW, Brown PO: Quantitative Monitoring of Gene-Expression Patterns with a Complementary-DNA Microarray. Science. 1995, 270: 467-470. 10.1126/science.270.5235.467.View ArticlePubMedGoogle Scholar
- Craig BA, Black MA, Doerge RW: Gene expression data: The technology and statistical analysis. Journal of Agricultural Biological and Environmental Statistics. 2003, 8: 1-28. 10.1198/1085711031256.View ArticleGoogle Scholar
- Nadon R, Shoemaker J: Statistical issues with microarrays: Processing and analysis. Trends in Genetics. 2002, 18: 265-271. 10.1016/S0168-9525(02)02665-3.View ArticlePubMedGoogle Scholar
- Cleveland WS: Visualizing data. 1993, Murray Hill NJ:, Summit NJ: At&T Bell Laboratories ; Published by Hobart PressGoogle Scholar
- Allison DB, Cui X, Page GP, Sabripour M: Microarray data analysis: from disarray to consolidation and consensus. Nature Reviews Genetics. 2006, 7: 55-65. 10.1038/nrg1749.View ArticlePubMedGoogle Scholar
- Affymetrix – Latin Square Data. [http://www.affymetrix.com/support/technical/sample_data/datasets.affx]
- Hoaglin DC, Mosteller F, Tukey JW: Understanding robust and exploratory data analysis. 1983, New York: WileyGoogle Scholar
- Tufte ER: The visual display of quantitative information. 2001, Cheshire, CT: Graphics PressGoogle Scholar
- Esty W, Banfield J: The Box-Percentile Plot. Journal of Statistical Software. 2003, 8.17: 1-14.Google Scholar
- Venables WN, Ripley BD: Modern applied statistics with S-PLUS. 1999, Statistics and computing, New York: Springer-Verlag, 3View ArticleGoogle Scholar
- Silverman BW: Density estimation for statistics and data analysis. 1986, London ; New York: Chapman and HallView ArticleGoogle Scholar
- Durbin B, Hardin J, Hawkins D, Rocke DM: A variance-stabilizing transformation for gene-expression microarray data. Bioinformatics. 2002, 18: S105-S110.View ArticlePubMedGoogle Scholar
- Durbin B, Rocke DM: Estimation of transformation parameters for microarray data. Bioinformatics. 2003, 19: 1360-1367. 10.1093/bioinformatics/btg178.View ArticlePubMedGoogle Scholar
- Geller SC, Gregg JP, Hagerman P, Rocke DM: Transformation and normalization of oligonucleotide microarray data. Bioinformatics. 2003, 19: 1817-1823. 10.1093/bioinformatics/btg245.View ArticlePubMedGoogle Scholar
- Huber W, von Heydebreck A, Sultmann H, Poustka A, Vingron M: Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics. 2002, 1: 1-9.Google Scholar
- Rocke DM: Design and analysis of experiments with high throughput biological assay data. Seminars in Cell & Developmental Biology. 2004, 15: 703-713. 10.1016/j.semcdb.2004.09.007.View ArticleGoogle Scholar
- Baldi P, Long AD: A Bayesian framework for the analysis of microarray expression data: Regularized t-test and statistical inferences of gene changes. Bioinformatics. 2001, 17: 509-519. 10.1093/bioinformatics/17.6.509.View ArticlePubMedGoogle Scholar
- Jain N, Thatte J, Braciale T, Ley K, O'Connell M, Lee J: Local pooled error test for identifying differentially expressed genes with a small number of replicated microarrays. Bioinformatics. 2003, 19: 1945-1951. 10.1093/bioinformatics/btg264.View ArticlePubMedGoogle Scholar
- Cleveland WS, Devlin SJ: Locally Weighted Regression – an Approach to Regression-Analysis by Local Fitting. Journal of the American Statistical Association. 1988, 83: 596-610. 10.2307/2289282.View ArticleGoogle Scholar
- Kendziorski C, Irizarry RA, Chen KS, Haag JD, Gould MN: On the utility of pooling biological samples in microarray experiments. Proc Natl Acad Sci U S A. 2005, 102 (12): 4252-7. 10.1073/pnas.0500607102.PubMed CentralView ArticlePubMedGoogle Scholar
- Choe SE, Boutros M, Michelson AMCGM, Halfon MS: Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset. Genome Biology. 2005, 6:Google Scholar
- Affymetrix: Microarray Suite User Guide, Version 5. 2001Google Scholar
- Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003, 4: 249-264. 10.1093/biostatistics/4.2.249.View ArticlePubMedGoogle Scholar
- Cope LM, Irizarry RA, Jaffee HA, Wu ZJ, Speed TP: A benchmark for affymetrix GeneChip expression measures. Bioinformatics. 2004, 20: 323-331. 10.1093/bioinformatics/btg410.View ArticlePubMedGoogle Scholar
- Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge YC, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JYH, Zhang JH: Bioconductor: open software development for computational biology and bioinformatics. Genome Biology. 2004, 5:Google Scholar
- Li C, Wong WH: Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection. Proceedings of the National Academy of Sciences of the United States of America. 2001, 98: 31-36. 10.1073/pnas.011404098.PubMed CentralView ArticlePubMedGoogle Scholar
- Wu Z, Irizarry R, Gentleman R, Murillo F, Spencer F: A model based background adjustment for oligonucleotide expression arrays. 2004Google Scholar
- Zhou L, Rocke DM: An expression index for Affymetrix GeneChips based on the generalized logarithm. Bioinformatics. 2005, 21 (21): 3983-9. 10.1093/bioinformatics/bti665.View ArticlePubMedGoogle Scholar
- Dabney AR, Storey JD: A reanalysis of a published Affymetrix GeneChip control dataset. Genome Biology. 2006, 7:Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.