Gene Filtering with a Random Forest Predictor

Tao Shi, Steve Horvath



                                                                                        Department of Human Genetics and Department of Biostatistics

                                                                                        University of California, Los Angeles, CA 90095


        Here we provide additional information on the Materials and Methods as well as the statistical software code used for the random forest analysis of following article:

Mehrian Shai R, Chen CD, Shi T, Horvath S, Nelson SF, Reichardt JKV, Sawyers CL (2007) IGFBP2 is a Biomarker for PTEN Status and PI3K/Akt Pathway Activation in Glioblastoma and Prostate Cancer. Proc Natl Acad Sci U S A. 2007 Mar 19

For the journal article, click here

We use random forest predictors (Breiman 2001) to find genes that are associated with PTEN status in brain cancer (glioblastoma multiform) and prostate tumors. In our data, we find that 10 probesets are associated with PTEN status irrespective of the tissue origin. While our main analysis uses a random forest importance measure to implicate these 10 probesets, we show that they are also statistically significant according to a Kruskal Wallis test or a Student T-test. We use supervised hierarchical clustering and a classical multi-dimensional scaling plot to visualize the relationship between the microarrays (patients).


                                                                        Supplement (Word version    PDF version)

                                                                        Microarray data (Zipped Excel File)

                                                                        Gene Summary data (Zipped Excel File)

                                                                        Raw, unnormalized, Affymetrix .cel files (Zipped)


                                                                        The randomGLM predictor is an attractive alternative to the random forest. It often is more acccurate and involves fewer covariates as described here

                                                                         Webpage Webpage

As an aside, we mention that random forest predictors can also be used for unsupervised learning (clustering). Read here.


Please send your suggestions and comments to: