Data and Statistical R Code:
Integrated Weighted Gene Co-expression Network Analysis (IWGCNA) with an Application to Chronic Fatigue Syndrome

Correspondence:     shorvath@mednet.ucla.edu

Method: Integrated Weighted Gene Co-expression Network Analysis (IWGCNA)
                 Weighted Gene Coexpression Network Analysis


Here we provide statistical code and data for the paper:

Presson AP , Sobel EM , Papp JC , Suarez CJ , Whistler T, Rajeevan MS, Vernon SD, Horvath S (2008) Integrated weighted gene co-expression network analysis with an application to chronic fatigue syndrome. BMC Systems Biology 2008, 2:95.

BMC Systems Biology


Background: Systems biologic approaches such as Weighted Gene Co-expression Network Analysis (WGCNA) can effectively integrate gene expression and trait data to identify pathways and candidate biomarkers. Here we show that the additional inclusion of genetic marker data allows one to characterize network relationships as causal or reactive in a chronic fatigue syndrome (CFS) data set.

Results: We combine WGCNA with genetic marker data to identify a disease-related pathway and its causal drivers, an analysis which we refer to as "Integrated WGCNA" or IWGCNA. Specifically, we present the following IWGCNA approach: 1) construct a co-expression network, 2) identify trait-related modules within the network, 3) use a trait-related genetic marker to prioritize genes within the module, 4) apply an integrated gene screening strategy to identify candidate genes and 5) carry out causality testing to verify and/or prioritize results. By applying this strategy to a CFS data set consisting of microarray, SNP and clinical trait data, we identify a module of 299 highly correlated genes that is associated with CFS severity. Our integrated gene screening strategy results in 20 candidate genes. We show that our approach yields biologically interesting genes that function in the same pathway and are causal drivers for their parent module. We use a separate data set to replicate findings and use Ingenuity Pathways Analysis software to functionally annotate the candidate gene pathways.

Conclusions: We show how WGCNA can be combined with genetic marker data to identify disease-related pathways and the causal drivers within them. The systems genetics approach described here can easily be used to generate testable genetic hypotheses in other complex disease studies.

Slide presentation:IWGCNA_Nov2008.pdf.


Data, R Software Tutorials, and Analysis Outline (Last Updated: 5/11/10)

The chronic fatigue data was generated by the Centers for Disease Control and generously provided as a challenge data set to the 2006 Critical Assessment of Microarray Data Analysis (CAMDA) conference.

Data: The file CFS.Data.zip (11.2 MB) contains 6 data files:  "Clinical_data_CFS.txt", "CFS_trait_legend.xls", "Expression_data_CFS.txt", "SNP_data_CFS.txt", "std-analysis-29-candidate-genes-IPA.txt" and "CFS_trait_data_127x47.txt ".

Network & causality functions: The file IWGCNA_2010.zip (258 KB) contains 4 files with R functions required for the IWGCNA: "NetworkFunctions_Jan2010.txt", "neo.txt", "sma_package.txt", and "CausalityFunctions.txt". These functions have received minor updates on 1/26/10 from their original post in 2008 due to valuable user input.

The tutorial for the CFS weighted gene co-expression analysis (IWGCNA) is available in both MS Word CFS_Online_Tutorial_Jan2010.doc and Adobe Acrobat CFS_Online_Tutorial_Jan2010.pdf formats. This tutorial contains all analyses described in our manuscript, and was updated with minor changes on 1/26/10 (see green bolded comments in the tutorials) to reflect valuable user input.

  1. Construction of a gene co-expression network
    1. Data pre-processing
      1. Code to remove outlying arrays
      2. Code to remove outlying genes
      3. Remove all arrays/samples relating to the intake classification control group (level 5); results in 127 arrays
    2. Use soft thresholding to determine the power for transforming the correlation matrix into an adjacency matrix
    3. Reduce the 8966 gene set to a more manageable number, ~3000 genes, by discarding genes with low connectivity
    4. Create the adjacency and topological overlap matrices
    5. Use hierarchical clustering to define gene modules
    6. Check that these modules are legitimate using heat maps and multi-dimensional scaling plots
  2. Examining network properties
    1. Create data subsets
    2. Compute the SNP significance measure for each subgroup
    3. Compute the connectivity for each subgroup
    4. Construct correlation bar and scatter plots stratified by module to compare the male and female samples
  3. Gene screening strategy
    1. Examine quantiles of the connectivities and correlations between the gene expressions, severity and SNP data
    2. Screen for genes based on correlation thresholds imposed in both males and homogenized females
    3. The screening strategy results in 20 candidate genes
  4. Second data set results
    1. First check for outliers
    2. Compute the equivalent connectivities and correlations in the second data set
    3. Create a gene co-expression network based on the second data set samples and color the clustered genes by their definitions in the original (127 sample) data set
    4. Now check whether the same candidate genes are selected when a similar screening strategy is applied
  5. Summarizing the results in a table of correlations
  6. Causality analysis using LEO (single.marker.analysis)
    1. Calculate LEO.NB.SingleMarker scores for all genes in the candidate module using all samples with severity scores (87)
    2. Calculate LEO.NB.SingleMarker scores for all genes in the candidate module using male and homogenized females with severity scores (76)
  7. Standard analysis of trait and gene expression data (ignoring the SNP marker)
    1. Calculate p-values and q-values for the correlation between severity and the gene expression data, 346 genes have the smallest q-values
    2. These 346 genes were analyzed using Ingenuity Pathways Analysis (IPA) software (August 2008) and the top network was selected, which consisted of 29 candidate genes
    3. Calculate correlations for the 29 candidate genes selected using IPA
      1. Calculate the correlations between these 29 candidate genes and severity
      2. Calculate their correlations with SNP12
      3. Calculate the ranks of their correlations with the blue module eigengene (out of 8966 genes)
    4. Extract these results for the 20 IWGCNA genes
    5. Compare the 20 IWGCNA results with the 29 standard analysis candidates


Slide presentation:IWGCNA_Nov2008.pdf.

For additional examples of weighted gene co-expression network analysis
see the  Weighted Gene Co-Expression Network Page. The WGCNA method is described in: Zhang and Horvath (2005), or for a more detailed mathematical description consider: Dong and Horvath (2007, 2008).

 


2009-05-27

Please send your suggestions and comments to: shorvath@mednet.ucla.edu