A New Gene Selection Method for Visual Analysis
Gene expressionA New Gene Selection Method for Visual Analysis
Jianping Zhou*and Georges Grinstein
Center for Biomolecular and Medical Informatics, University of MassachusettsLowell , Lowell, MA01854,USA
1
A New Gene Selection Method for Visual Analysis
[*]abstract
Motivation:Microarray gene expression data present the simultaneous measurements of thousands of genes expression levels. Selecting a small significant and decisive genes subset is a necessary pre-process step for the identification of biologically interesting patterns and sample classification. We introduce a new statistical supervised learning method, called the mean ratio method, for gene selection and use it in high dimensional visualizationsfor exploratory analysis and visual classification.
Results: We applied the mean ratio method to Khan’sSmall Round Blue Cell Tumorsgene expression data for gene selection,compared it to a number of techniques, and illustrated the comparisons visually in heatmaps and the classificationvisually in Radviz. Our study shows that mean ratio method is cost effective and performed well among severalcommon and relevant techniques.
Availability: The tool is implemented in Java within the Universal Visualization Platform (Gee et al. 2004) and is available on request for academic use and licensable for commercial use.
Contact: ,
Supplementary Information:
1introduction
Many clinical studies show that some general pathological diagnostic systems are not accurate enough for the selection of the appropriate therapy type or treatment dose based on specific clinical categories. Microarray technology in conjunction with advanced classification techniques makes it possible to build more reliable disease diagnoses andprediction models leading to suggestions about therapy response, prognosis, or metastasis phenotype.
The use of gene expression data usually involves two main steps: gene clustering and selection and sample classification. The current computational approach to molecular analysis in cancer genomicsis to identify gene patterns and statistically select the most appropriate genes using cluster analysis. Such sample classification leads to patient classification and diagnostic prediction. Since the number of dimensions (attributes) of the data is large, there are many computational dimensional reduction methods. However these techniques are mostly concernedwith the efficiency of the computation and interpretability of the results.
Information visualization technology can achieve effectiveness, expressiveness, interaction and efficiency simultaneously. Visualization bringspowerful human perceptioninto the identification of valuable information hidden in data. However there are very few visualizations that can deal with the limitations of screen space.
Although Radviz (Hoffman et al. 1997) meets the need for high dimensional data presentation by being able to display thousands of dimensions,the order of the dimensional is significant for informative displays (Fotheringham 1999, Grinstein et al. 2002).To solve this problem, several dimensional selection and dimension layout techniques have been developed (McCarthy et al. 2004, Leban et al.2005).
In this paper we introduce a simple yet effective alternative technique, the mean ratio method, that addresses this problem and apply this method to Khan’sSmall Round Blue Cell Tumors (SRBCT)data for gene selection. To show the effectiveness of this method, we conduct comparison studies with several relevant techniques and use heatmaps for visual comparion, Radviz for exploratory analysis and visual classification.
1.1Radviz Visualization, Dimension Selection and Dimensional Anchor Layout
Radviz (Hoffman et al. 1997) falls into the class of spring-force analysis models as shown in Figure 1(a).Imagine springs connected from record’s two dimensional position to each of its coordinates (called dimensional anchors) on the boundary of the circle. The final record projection position is precisely where the sum of the forces is zero. Radviz projects data on the 2D planeas shown in Figure 1(b).The position of a record on the projection is determined by the respective x and y normalized coordinatevalues averaged on all dimensions:
Figure 1 Radiz spring force model and projection
Radviz has a number of features:
- Coordinates with higher values pull the projected data pointscloser toward the dimensional anchors. If a projected data point is close to a dimensional anchor, it implies the value for this dimension is higher than any others.
- Records with quite different coordinate values may be mapped onto the same point (e.g., (0, 0,... , 0) and (a, a, …, a) with a any real number).
- Points with roughly equal coordinate values lie close the center area.
- Points for which specific coordinates are large and close in value, and for which those coordinates are equally spread around the circumference of the circle are also near the center (they are pulled in close to equal opposing directions).
Thus Radviz has both strengths(lots of dimensions andintuitive) and weaknesses (lossy – different points can map to the same locations). Radviz is able to display hundreds of thousands of dimensions (useful for chemical descriptors for example) with no dimensional reduction or transformation. The typical Radviz weakness is that an arbitrary layout of the dimensions around the circle will typically not show record distinctions and will likely not reveal hiden patterns, structures, trends, or outliers. Thus the dimensional anchors must be computationally laid out around the circle to display such patterns or structures.
Selecting the top ranked dimensions which significantly ensure the projection quality and properly laying out the corresponding dimensional anchors is the obvious solution. Once this is done, Radviz is the only visualization able to display millions of dimensions and one of the few successful ones which are able to perform visual classification and clustering. From a domain aspect, it is also meaningful. For instance as pointed out byDeutsch (2003), to classify samples in microarray gene expression data it is necessary to decide which genes should be included in a predictor. Having too many genes is not optimal as many of the genes are largely irrelevant to the diagnosis and mostly have the effect of adding noise, decreasing the “information criterion". This is particularly severe with a noisy data set and few subjects. Therefore aneffort is made to choose an optimal set of genes for which to startthe training of a predictor.Including too few dimensions will also not yield good projection quality and thus not easily allow us to discriminate a classification of the data correctly. Of course the most critical step is the identification of the appropriate dimensions or genes.
Although a few statistical or feature selection techniques in machine leaning can help in selecting dimensionsfor Radviz, we lack systematic research and comparison studiesfor choosing right technique. McCarthy et al. (2004) used a pairwise t-test-based class distinction algorithm to maximize the distance separating clusters or classes. This algorithm calculates the statistic t-valuewith an optional Bonferroni correction for each class against all others for every dimension. The highest voted class, the one with the highest t-value, is assigned to the dimension. The highest voted class is also called the discrimination class for that dimension, which means it is the dimension (or coordinate) that will influence or discriminate this class member more than any other. The t-value reflects the degree of influence. Once the relatively highly influenced dimensions are selected based on some criterion, the corresponding dimensional anchors are placed on the Radviz circle. The Radviz circumference is divided into equally sized arcs for each class, and the corresponding class dimensions are evenly placed on that arc. For the selection criterion, McCarthy et alused an algorithm called the Principal Uncorrelated Record Selection which limits the amount of correlation among the selected dimensions, which in turns ensures the incorporation of major difference and representation in the feature space. Thisyields a subset of the original dimensions, capturing the most relevant variability in the data and limits redundancy. Leban et al. (2005) used the ReliefF metric (Kononenko, 1994) to select the best dimensions forRadViz in the VizRankpackage. Based on ReliefF, VizRank defines a measurement, called the average of predicted probabilities, to evaluate the projection quality. This measurement aims to project the data with classes well separated and less overlap. Note that there are many other discriminant analyses methods which might benefit Radviz, for examples several technique we will mention in Section 3 of this paper.Ringnér, Peterson Khan (2002) provided an overview of their applications, and Dudoit et al. (2000) compared different discrimination methods for tumor classification.
1.2Mean Ratio Method
A good quality Radviz projection meets three requirements similar to clustering requirements: close intra-class members, separated inter-class members and radially spread-out record projections. As previously mentioned, to get a good quality Radviz projection, only a certain number of the top ranked dimensions need to be taken into account. Based on Radviz’s inherent features, these top ranked dimensions can be considered as significant dimensions which characterize the data features, and serve as the selected features to train classifier. The mean ratio method is a supervised learning method and requires the original dataset contain a class label dimension. If the dataset doesn’t have this classification dimension or some other predetermined class assignment for each record, clustering or unsupervised learning can discover this dimension. However the selection of appropriate techniques and configurations are beyond the scope of this paper.
Let V be a set of records of dimension d partitioned into C disjoint subsets,C > 1 and Vicorresponding to the class label i with Ni = size of Vi.
For each dimension j, j = 1, …, d we define Mj to bethe maximum of { mean dimj (V1), mean dimj(V2), …, mean dimj(Vc) } where dimj is the jth dimension values of the recordset, the dimension is locally normalized. Wecall the index jk where the maximum occurs the discrimination class label. If there’s more than one, select the one with the lowest dimension.
For each dimension j we define the mean ratio value mrj:
For each dimension the mean value of a subset is typicall defined as the distance of the subset centroid to origin. The mean ratio value measures the average degree of the distance difference between the discrimination class centroid and other class centroids.Its value is between 0 and 1. The smaller the ratio value, the bigger the distance difference. Thus for Radviz,the mean ratio value is then a heuristic to measure how strong the spring force pulls the record projection towards the discrimination class dimension anchor. The mean ratio value can be used to rank dimensions and screen out top ones.The smaller the meanratio value, the stronger the pulling force towardthe highest voted class dimension anchor, orthe closer the distance between the record projection and that dimensional anchor.From discriminant analysis point of view, the mean ratio is a discriminant score which is used to measure statistical significance of dimensions. After selecting the top ranked dimensions, wegroup and lay out the selected dimension anchors by class. The projection result separates and spreadsout the desired classes in radial direction. Note that the discrimination class assigned to each dimension is not a dimensional classification which means we can not solely rely on the highest dimensional value to classify the record to the discrimination class. In Radviz, its appropriate classification depends on its projection position. The closer to the dimensional anchor, the more appropriate the classification to corresponding dimension discrimination class.
There are two types of dimensional selection criteria, a rank order limit number and a global threshold. The rank order limit number criterion limits the selected dimension number for each class up to this number. If the number of all dimensions in certain discrimination class category is less than the criterion number, all dimensions in this category are selected. The global threshold criterion sets up a global cutoff across all class categories. Regardless of discrimination class type only dimensions that are above the cutoff are selected. The global threshold criterion could either be a percentage or a valid mean ratio value between 0 and 1. In our experiments with many different datasets, the rank order criterion usually performs better than the global threshold as most real world datasets do not have balanced dimensional discrimination for each class. The dominant dimensions which discriminate parts of classes suppress others, thus hiding information. In order to avoid this situation until more is understood about the data the rank order criterion is our default option in our implementation.
An important characteristic of the mean ratio value lies on its independence of variance. Although taking account of variance is a common way to reduce the impact of noise, e.g. in the t-value and F-value formulas, these computations suppress outliers. Since identification of outliers is often an important function in certain visualization applications, we assume the data set is appropriately pre-cleaned and thus accept this limitation.
The mean ratio method is thus computationally simpler than the t-test. Unlike the t-test and other dimensional selection techniques, the mean ratio method has one other limitation in that it requires the dimensions to be numerical and one where larger valuesare more significant than smaller ones. For certain applications this limitation is not a significant one aswith microarray gene expression data where the focus there in on genes whose expression levels are high. For general datasets, this limitation is problematic since the record values for discrimination class members may not be high.
2results
2.1Comparison of Mean Ratio Method with Others
We use Khan’sSmall Round Blue Cell Tumors (SRBCT)gene expression data (Kahn 2001 supplement) to show the effectiveness of the mean ratio method for gene expression data and compare its result with the t-statistic method and five other techniques. SRBCT comprises four distinct diagnostic childhood tumor categories which are NB (Neuroblastoma), RMS (Rhabdomyosarcoma), BL (Burkitt Lymphomas, subset of non-Hodgkin lymphoma), and EWS (Ewing family). Theoriginal dataset contain 2308 genes, 63 training samples including both tumor biopsy materials (13 EWS and 10 RMS) and cell lines (10 EWS, 10 RMS, 12 NB and 8 BL ), and 25 test samples. The test samples consist of 14 SRBCT tumors (5 EWS, 5 RMS and 4 NB),6 SRBCT cell lines (1 EWS, 2 NB and 3 BL),2 normal muscle tissues, 1 undifferentiated sarcoma, 1 osteosarcoma and 1 prostate carcinoma. The last 5 are categorized as non-SRBCT samples.
Figures2(a) and 2(b) are the Radviz projections using the mean ratio and the t-statistic methods respectively. The dimensional anchor label is the selected gene dimension ID; the color maps map to the class name; and the dot size of the dimensional anchor indicates the significance degree: the lower the mean ratio value or the higher t value, the more significant the dimension.
Figure2 Radviz projections with 5 rank order limit number for Khan’s SRBCT dataset containing 63 controls
From the Radviz projections and information shown in Table 1, we see that both techniques perform well in separating the four classes. But many of the selected dimensions (genes) are different. We also collected the results from five other published Khan SRBCT data studies and gathered gene functionality from a literature search. Each of five studies usesa different classification technique, and selectsits own significant genes with discrimination class labels for classifier training. The collected results can be found in our supplementary information folder.
1
A New Gene Selection Method for Visual Analysis
Table 1. Selected genes by using themean ratio and the t-statistic methods with a 5 rank order limit for each class
Gene ID / Gene Name / Class1 / MR Value / TValue / P
Value / F
Rank2 / ANN
Rank3 / Selected
in MR4 / Selected
in T4 / Status5
241412 / ELF1 / BL / 0.0944 / 11.98 / 1.90E-17 / 12 / 58 / x / x
183337 / HLA DMA / BL / 0.1127 / 13.46 / 1.23E-19 / 4 / 23 / x / x / x
80109 / HLA DQA1 / BL / 0.1025 / 7.37 / 6.46E-10 / 46 / 13 / x
740604 / ISG20 / BL / 0.1111 / 8.40 / 1.14E-11 / 29 / >96 / x / x
767183 / HCLS1 / BL / 0.1198 / 9.74 / 6.95E-14 / 13 / 70 / x
236282 / WAS / BL / 0.1860 / 15.31 / 3.26E-22 / 1 / >96 / x / x
47475 / PIR121 / BL / 0.2294 / 12.04 / 1.56E-17 / 9 / >96 / x
624360 / PSMB8 / BL / 0.2336 / 11.95 / 2.18E-17 / 6 / >96 / x
377461 / CAV1 / EWS / 0.0951 / 11.25 / 2.59E-16 / 11 / 18 / x / x / x
770394 / FCGRT / EWS / 0.0975 / 14.75 / 1.88E-21 / 2 / 6 / x / x
43733 / GYG2 / EWS / 0.0636 / 5.90 / 1.87E-07 / 155 / 9 / x / x
866702 / PTPN13 / EWS / 0.0684 / 8.32 / 1.57E-11 / 40 / 15 / x
357031 / TNFAIP6 / EWS / 0.0787 / 6.26 / 4.73E-08 / 154 / 16 / x
1435862 / MIC2 / EWS / 0.1536 / 11.04 / 5.60E-16 / 10 / 73 / x / x
814260 / FVT1 / EWS / 0.1272 / 10.35 / 6.92E-15 / 18 / 75 / x / x
491565 / CITED2 / EWS / 0.3308 / 8.38 / 1.24E-11 / 34 / >96 / x
325182 / CDH2 / NB / 0.1416 / 10.26 / 9.79E-15 / 17 / 72 / x / x
812105 / AF1Q / NB / 0.1478 / 12.27 / 7.05E-18 / 3 / 22 / x / x / x
44563 / GAP43 / NB / 0.0825 / 6.80 / 5.97E-09 / 93 / 31 / x
135688 / GATA2 / NB / 0.1234 / 6.37 / 3.13E-08 / 96 / 47 / x
629896 / MAP1B / NB / 0.1831 / 8.64 / 4.58E-12 / 20 / 11 / x / x
383188 / RCV1 / NB / 0.2495 / 9.52 / 1.58E-13 / 16 / 29 / x
134748 / GCSH / NB / 0.3697 / 8.90 / 1.65E-12 / 23 / >96 / x
81518 / OCRL / NB / 0.3510 / 8.74 / 3.09E-12 / 19 / >96 / x
244618 / EST / RMS / 0.0762 / 8.27 / 1.88E-11 / 37 / 7 / x / x / x
784224 / FGFR4 / RMS / 0.0812 / 11.55 / 8.79E-17 / 7 / 68 / x / x / x
461425 / MYL4 / RMS / 0.0453 / 5.24 / 2.26E-06 / 219 / 42 / x / x
298062 / TNNT2 / RMS / 0.0667 / 6.63 / 1.14E-08 / 138 / 25 / x / x
245330 / IGF2 / RMS / 0.0992 / 3.19 / 2.29E-3 / >818 / 46 / x / x
796258 / SGCA / RMS / 0.1458 / 9.75 / 6.65E-14 / 21 / 89 / x / x
769716 / NF2 / RMS / 0.1716 / 8.60 / 5.29E-12 / 48 / >96 / x
839552 / NCOA1 / RMS / 0.2138 / 8.11 / 3.62E-11 / 53 / >96 / x
1. Same highest vote or discrimination class by means of mean ratio, t-statistic and ANNmethods
2. Calculated by using ChipST2C software (BCM Peterson Software Lab.
3. Collected from Khan‘s experiment(Khan et al., 2001)
4. Both experiment results with mean ratio and t-statistic methods are configured as top 5 rank order limit number for each class
5. Indication of whether related to SRBCTor certain type of cancer or tumor based on literature search. (Khan et al., 2001; Bicciato et al., 2003; Wan, 2004; Weber et al., 2004; Dua et al.,2001; Urashima et al., 1999 )
1
A New Gene Selection Method for Visual Analysis
From Table 1 we noticed that all the selected dimensions but one determined from the mean ratio method are within the top 96 ANN ranks, whereas only 2/3 of those by using the t-statistic method are within that range. Several selected genesdetermined by the mean ratio method have high p-values and F ranks, and thus will not pass the t-test since a high p-value indicates less significance. But GeneID 245330, as well as its two clones GeneID296448 and 207274, denoting Gene IGF2,has been reported to relate to a wide range of cancers, includingRMS tumor (Bicciato et al., 2003; Wan, 2004). The collection information shows that 4 out of 5 techniques identify its significance to SRBCT tumors. ANN even ranks GeneID 296448 and 207274 as the number 1 and 2 most important genes. This result again suggests that the t-test doesn’t always work well. Conversely some t-test selected geneswhich do show connections with certain type of cancer or tumor in publications are not included in the mean ration selected genes. For example, GeneID 236282 denoting Gene WAS is the most significant gene in both the t and F tests and its discrimination class is identified as BL by the t-test. Butthe mean ratio method and none of the other surveyed methods identified it within 5 ranks in BL (the mean ratio method ranks it the 10thin BL). Another interesting discovery is related to Gene MIC2 (GeneID 1435862). Khan et al.(2001), reportedMIC2to be sensitive to EWS, but cannot alone be used to discriminate EWS as it was also expressed in several RMS. This explains why MIC2 is selected by the t-test, but not by the mean ratio method with the top 5 rank order limit criterion (the mean ratio method ranks it the 8thin EWS). Since the mean ratio method emphasizes average class discrimination, whereas the t-test measures statistical significance, the above difference is not a surprise. In fact, different dimensional selection techniques have different working mechanisms and unique characteristics. In data mining, we know that a single method is neither able to discover entire data featuresnor apply to all different types of datasets (Hoffmanet al., 1997). Thus the discovery pragma should be that even though different techniques derive different results, we should not exclude any single result unless we have some additional knowledge coming either from the literature, other algorithms, or domain expertise.