Fast functions for correlation and hierarchical clustering

Peter Langfelder1 and Steve Horvath1,2



1 Dept. of Human Genetics, UC Los Angeles, 2 Dept. of Biostatistics, UC Los Angeles

Peter (dot) Langfelder (at) gmail (dot) com, SHorvath (at) mednet (dot) ucla (dot) edu

Abstract

Many high-throughput biological data analyses require the calculation of large correlation matrices and/or clustering of a large number of objects. The standard R function for calculating Pearson correlation can handle calculations without missing values efficiently, but is inefficient when applied to data sets with a relatively small number of missing data. We present an implementation of Pearson correlation calculation that can lead to substantial speedup on data with relatively small number of missing entries. Further, we parallelize all calculations and thus achieve further speedup on systems where parallel processing is available. A robust correlation measure, the biweight midcorrelation, is implemented in a similar manner and provides comparable speed. The functions cor and bicor for fast Pearson and biweight midcorrelation, respectively, are part of the updated, freely available R package WGCNA. The hierarchical clustering algorithm implemented in R function hclust is an order n3 (n is the number of clustered objects) version of a publicly available clustering algorithm by Fionn Murtagh (http://www.classification-society.org/csna/mda-sw/ ) . We present the package flashClust that implements the original algorithm which in practice achieves order approximately n2 , leading to substantial time savings when clustering large data sets.

Update (October 2014): R core team recently modified the code in the standard function hclust implemented in package stats. The new "standard" hclust is now as fast or faster than the flashClust presented here.

Article reference

Peter Langfelder and Steve Horvath, Fast R Functions for Robust Correlations and Hierarchical Clustering. Journal of Statistical Software 46 (11) 1--17 (2012). http://www.jstatsoft.org/v46/i11

R Software

Functions described here are part of two R packages:

Detailed description and installation instructions are available on the linked pages dedictated to each package.

Example R code

On a separate page we provide the R code that we used to measure the performance of the functions presented here compared to the standard R functions.




stats for wordpress