Unsupervised Learning with Random Forest Predictors

Tao Shi, Steve Horvath

Correspondence:                                                              shorvath@mednet.ucla.edu


                                                                                         Department of Human Genetics and Department of Biostatistics

                                                                                         University of California, Los Angeles, CA 90095


        A random forest (RF) predictor (Breiman 2001) is an ensemble of individual tree predictors. As part of their construction, RF predictors naturally lead to a dissimilarity measure between the observations. One can also define an RF dissimilarity measure between unlabelled data: the idea is to construct an RF predictor that distinguishes the `observed' data from suitably generated synthetic data (Breiman 2003). The observed data are the original unlabelled data while the synthetic data are drawn from a reference distribution. Recently, RF dissimilarities have been used successfully in several unsupervised learning tasks involving genomic data. Unlike standard dissimilarities, the relationship between the RF dissimilarity and the variables can be difficult to disentangle. Here we describe the properties of the RF dissimilarity and make recommendations on how to use it in practice.
        An RF dissimilarity can be attractive because it handles mixed variable types well, is invariant to monotonic transformations of the input variables, is robust to outlying observations, and accommodates several strategies for dealing with missing data. The RF dissimilarity easily deals with large number of variables due to its intrinsic variable selection, e.g. the Addcl1 RF dissimilarity weighs the contribution of each variable on the dissimilarity according to how dependent it is on other variables.
        We find that the RF dissimilarity is useful for detecting tumor sample clusters on the basis of tumor marker expressions. In this application, biologically meaningful clusters can often be described with simple thresholding rules.

KEY WORDS: random forest clustering, biomarkers, ensemble predictors, random forest distance, random forest dissimilarity, tree predictor clustering


                                                                        A technical report for random forest clustering can be found here

                                                                        To cite the technical report, please use:

    Tao Shi and Steve Horvath (2006) Unsupervised Learning with Random Forest Predictors. Journal of Computational and Graphical Statistics. Volume 15, Number 1, March 2006, pp. 118-138(21)

                                                                        For the journal article, click here


                                                                        Word version

                                                                        PDF version

                                                                        TXT file version

                                                                        R-functions used in the tutorial

                                                                        Test data (comma delimited text file or Excel File)

Student Presentation

                                                                        A student presentation for random forest clustering can be found here

Other Materials

                                                                        The randomGLM predictor is an attractive alternative to the random forest. It often is more acccurate and involves fewer covariates as described here

                                                                         Webpage Webpage

                                                                        The random forest predictors can also be used for gene screening as described here.

                                                                        Read article 1 and article 2


Please send your suggestions and comments to: shorvath@mednet.ucla.edu