The functional capabilities of eUTOPIA are showcased here by processing a publicly available dataset GSE92900 [14] obtained from the GEO [15] repository. The phenotype table provided in Additional file 2 is used for sample annotations.
The quality control of the raw data is performed with the arrayQualityMetrics R package. The quality report represents the distribution of intensities in arrays, the comparison between arrays (Fig. 2), individual array quality, and it also reports outliers.
In Fig. 2a, array 10 has high distance measures with the rest of the arrays as represented by the bright and dull yellow colored cells. This distance information can also be interpreted from the hierarchical cluster in the right margin of Fig. 2a, where array 10 clusters separately from the others. Outliers are detected based on this distance information as arrays with exceptionally large distance to all other arrays. A summarized distance measure is determined for each array by summing up the distances to all other arrays, which is checked against an outlier threshold obtained on the basis of well-established interquartile range (IQR) rules defined by Tukey Fences. The outliers can be observed in Fig. 2b, where array 10 has a large summarized distance from the rest of the arrays and can be markedly recognized as an outlier.
Normalization is performed by using the quantile method for between-array normalization. The normalized data can be observed by the representation of expression values as box plot, density plot, and mean difference plot (MDplot) (Fig. 3).
The difference in the distribution of expression values in different arrays can be observed in the box plot before normalization (Fig. 3a), suggesting the need for adjustment of the distributions for fair comparison across the set of arrays. The distribution of expression values is harmonious across the arrays in the box plot after normalization with the quantile method (Fig. 3b).
The difference in the distribution of expression values from individual channels of different arrays can be observed from the smoothed curves in the density plot (Fig. 3c) before correction. Only a single smoothed curve is visible in the density plot after normalization (Fig. 3d) since the channels from all arrays have the same distribution of expression values as a result of quantile normalization.
The larger scale of log-2 ratios (M-values on the y-axis) observed in the MDplot before normalization (Fig. 3e) shows that the data points are farther away from zero log2 expression ratio, suggesting bias. The average log2 expression values also have a large scale (x-axis) before normalization. The smaller scale of M-value (y-axis) observed in the MDplot after normalization (Fig. 3f) shows that data points are much closer to zero log2 expression ratios as the bias has been adjusted. From this plot, it is also possible to appreciate how the average log2 expression values have much smaller scale (x-axis) after normalization.
The known batches in the data are identified by inferring the variation associated with the sample annotations from the prince plot and by observing the significance of interrelatedness between the sample annotations from the confounding plot. The ‘group’ annotation variable is considered as the variable of interest. The prince plot (Fig. 4a) displays that variables ‘slide’, ‘area’, ‘array’, ‘n.mice’, ‘operator’ and ‘date’ are associated with principal components representing high variation in the data. The confounding plot (Fig. 4b) suggests that the variable ‘RIN’ is confounded with the variable of interest ‘group’ and thus cannot be taken forward for any data correction. Also, variables ‘slide’, ‘area’, ‘operator’, and ‘date’ are found to be confounded with ‘array’. The ‘array’ variable represents the highest variability in the data and is favored over ‘slide’, ‘area’, ‘operator’, and ‘date’ as a batch variable. Thus, it can be inferred from these plots that the final set of variables for batch correction is ‘array’, ‘dye’ (standard known batch), and ‘n.mice’ (Fig. 4).
The effect of batch correction performed with eUTOPIA can be studied by comparing the variation information before and after correction. The first three principal components in the prince plot generated from data before batch correction (Fig. 5a) represent 28, 16, and 11% variation in the data. The variable of interest ‘group’ was significantly associated (p-value: 0.004) with only the second principal component, while the identified batch variable ‘array’ was significantly associated with all three principal components (p-values: 3e-06, 0.02, and 0.06, respectively) and ‘n.mice’ was significantly associated with the first principal component (p-value: 0.03). In the prince plot generated from the data after batch correction (Fig. 5b), the first three principal components represent 49, 23, and 11% variation in the data, which is a significant shift. The variable of interest ‘group’ is now observed to have a high association with all three principal components (p-values: 1e-13, 3e-14, and 7e-10, respectively). The batch variable ‘array’ has a comparably lesser significant association with the first principal component (p-value: 0.06) and batch variable ‘n.mice’ also has a comparably lesser association with the third principal component (p-value: 0.06). While batch variables ‘slide’, ‘area’, ‘operator’, ‘date’, and ‘dye’ are no longer associated with principal components representing high variation. Thus, allowing the user to check that the variation associated with the variable of interest is preserved while the noise associated with the batch variables has been corrected. The principal component analysis (PCA) plot generated from the data before batch correction (Fig. 5c) displays samples scattered across the projected components with no obvious grouping of samples by the variable of interest. The PCA plot (Fig. 5d) after the known batch correction displays a more discrete grouping of samples (smaller intragroup distances) and better separation of groups (intergroup distances) in the projected components.
The hidden sources of variation can be identified as surrogate variables by using sva method. The attributes of these surrogate variables must be inspected before taking any decisive action. These sources of hidden variation can represent technical batch information or biological information such as sub-types. Thus, care must be taken to choose the right candidate variables for batch correction. The sva analysis resulted in three surrogate variables that were significantly associated with the first three principal components representing high variation as observed in the prince plot (Fig. 6a). These identified surrogate variables show significant association with known batch variables ‘slide’, ‘area’, ‘array’, ‘n.mice’, ‘operator’, and ‘date’ (Fig. 6b). It is noticeable that none of these surrogate variables have a significant association with the ‘dye’ variable which can be explained by the fact that ‘dye’ variable is not associated with any principal component representing high variation in the prince plot generated from uncorrected data (Fig. 6a).
The discretized surrogate variables were associated with the first three principal components in the prince plot generated from data before batch correction (Fig. 7a). The variable of interest ‘group’ was significantly associated (p-value: 0.004) with the second principal component. While the surrogate variables ‘svaD.1’ was significantly associated with the first principal component (p-value: 6e-04), ‘svaD.2’ was significantly associated with the second and third principal components (p-value: 0.05, and 5e-05, respectively), and ‘svaD.3’ was significantly associated with the third principal component (p-value: 0.06). In the prince plot generated from the data after batch correction (Fig. 7b), the variable of interest ‘group’ is now observed to have a high association with the first principal component (p-value: 3e-07). The discretized surrogate variables ‘svaD.1’, ‘svaD.2’, and ‘svaD.3’ are no longer associated with the first three principal components representing high variation. The removal of hidden batch effects with surrogate variables removes artefactual technical variation from the data, thus revealing the true variation signal of the variable of interest. The PCA plot generated from the data before batch correction (Fig. 7c) displays samples scattered across the projected components with no obvious grouping of samples by the variable of interest ‘group’. The grouping of samples in the PCA from the corrected data (Fig. 7d) is more discrete than the uncorrected data, but the separation between the groups (intergroup distances) is not very clear, only ‘rCNT’ and ‘Ctrl’ show visibly distinct separation (Fig. 7d), while the samples from other groups are much closer. The tighter packed grouping of samples can be observed by the circular outlines drawn around the groups in Fig. 7d.
The differential expression analysis is performed by specifying the limma linear model parameters in the eUTOPIA interface. Here, the ‘group’ annotation variable is specified as the ‘variable of interest’. The known batches and the surrogate variables are instead specified as covariates for the corrected data. The contrasts are defined to compare each nanomaterial ‘Baytubes’, ‘Fullerene’, ‘GNF’ (Graphite Nano Fibers), ‘rCNT’, ‘SES’, and ‘tCNT’ to the control ‘Ctrl’ samples. The differential expression analysis results are filtered by logFC cutoff threshold 1 and P. Value cutoff threshold 0.05. Differential expression results are explored by the medium of plots (Fig. 8) to classify contrasts and conditions by expression patterns. The intersection of differentially expressed gene sets represented as an UpSet plot (Fig. 8a) can be used to infer that rCNT samples have overall larger variation contrast against the control (Ctrl) samples, while the rest of the nanomaterials have much smaller and comparable variation contrasts. There is higher overlap between rCNT and tCNT sets with larger intersection sizes and multiple intersections. Nanomaterial specific genes are substantial in rCNT, GNF, Baytubes, and tCNT while specific genes are not reported for SES and Fullerene because of their smaller values. The distribution of differential genes from each contrast is represented as a volcano plot (Fig. 8b) which can be used to infer that the genes in this particular contrast/comparison are more inclined towards overexpression with higher significance values while the underexpressed genes are fewer and have lower significance values. The expression pattern of top differentially expressed genes is represented as a Heatmap (Fig. 8c), which can be used to infer that there are two major parent clusters of samples. The first parent cluster contains a child cluster of Fullerene (hollow carbon sphere) and GNF (long rigid carbon fiber), which is further clustered with rCNT (long rigid multi-walled CNT). The second parent cluster contains two child clusters, control samples (Ctrl) cluster together with Baytube (short tangled multi-walled CNT) to form the first child cluster, SES (short rigid multi-walled CNT) and tCNT (long tangled multi-walled CNT) cluster together to form the second child cluster. The nanomaterials in the first parent cluster have a larger diameter and smaller surface area compared to the nanomaterials in the second parent cluster. The expression pattern of the selected top differentially expressed genes align very well with the nanomaterial intrinsic properties. The intrinsic properties of the nanomaterials are provided in Additional file 3.