Random Forest Clustering Applied to Tissue Microarray Data

 

Tao Shi, Steve Horvath

(http://www.ph.ucla.edu/biostat/people/horvath.htm)

 

Department of Human Genetics and Department of Biostatistics

University of California, Los Angeles, CA 90095

 

Here we provide R code and data underlying the following article:

Shi T, Seligson D, Belldegrun AS, Palotie A, Horvath S. (2005) Tumor classification by tissue microarray profiling: random forest clustering applied to renal cell carcinoma. Mod Pathol. 2005 Apr;18(4):547-57

PDF file

 

ABSTRACT

 

We describe a novel strategy (random forest clustering) for tumor profiling based on tissue microarray data. Random forest clustering is attractive for tissue microarray and other immunohistochemistry data since it handles highly skewed tumor marker expressions well and weighs the contribution of each marker according to its relatedness with other tumor markers. Since the procedure is unsupervised, no clinicopathological data or traditional classifications are used a priori. To facilitate unsupervised learning, an intrinsic dissimilarity measure between the patients was constructed with a random-forest analysis of the tumor markers. A technical report that describes Random Forest can be found here.

 

The RF clustering algorithm is shown recently to be particularly suitable for Tissue Microarray (TMA) data for the following reasons. First, the clustering results do not change when one or more covariates are monotonically transformed since the dissimilarity only depends on the feature ranks, obviating the need for symmetrizing skewed covariate distributions. Second, the RF dissimilarity weighs the contributions of each covariate on the dissimilarity in a natural way: the more related the covariate is to other covariates the more itwill affect the definition of the RF dissimilarity. Third, the RF dissimilarity does not require the user to specify threshold values for dichotomizing tumor expressions. External threshold values for dichotomizing expressions in unsupervised analyses may reduce the information content or even bias the results. We also compared the random forest clustering approach to the standard Euclidean distance based approach. Although there is good overlap between the two algorithms, we find that the random forest clustering method works better for these data (see the supplement information for Shi et al. 2004). To visualize the tumor samples, we used classical multidimensional scaling, which takes as input the random forest dissimilarity between the samples and returns a set of points in a 2 dimensional space such that the distances between the points are approximately equivalent to the original distances.

 

Below we list an R tutorial and a sample data set involving 307 tumor samples and 8 tumor markers. The data were generated by David Seligson from the UCLA tissue array core (http://www.genetics.ucla.edu/tissuearray/).

 

 

R SOFTWARE TUTORIAL: RFclustering applied to Renal Cancer
 

    Microsoft Word version

    PDF version

 

 

DEMO CODE

 

1) To install the R software, go to http://www.R-project.org

2) After installing R, you need to install two additional R packages: randomForest and Hmisc

Open R and go to menu "Packages\Install package(s) from CRAN", then choose randomForest. R will automatically install the package.When asked "Delete downloaded files (y/N)? ", answer "y". Do the same thing for Hmisc

3) Download the zip file containing:

a) R function file: "FunctionsRFclustering.txt", which contains several R functions needed for RF clustering and results assessment

b) A test data file: "testData.csv"

c) MDS coordinate file: "cmd1.csv"

d) The tutorial file: "RFclusteringTutorial.txt"

4) Unzip all the files into the same directory, for example, it is "C:\temp\RFclustering"

5) Open the R software by double clicking its icon.

6) Open the tutorial file "RFclusteringTutorial.txt" in a text editor, e.g. Notepad or Microsoft Word

7) Copy and paste the R commands from the tutorial into the R session. Comments are preceded by "#" and are automatically ignored by R.

 

 

REFERENCES
 

The following article describes theoretical studies of RF clustering.

  • Tao Shi and Steve Horvath (2006) Unsupervised Learning with Random Forest Predictors. Journal of Computational and Graphical Statistics. Volume 15, Number 1, March 2006, pp. 118-138(21)

General intro to random forest

The following reference describes the R implementation of random forests

  • Liaw A. and Wiener M. Classification and Regression by randomForest. R News, 2(3):18-22, December 2002.


2007-02-27

Please send your suggestions and comments to: shorvath@mednet.ucla.edu