Microarray analysis exercises

WIBR Microarray Analysis Course 2004

Starting Data     Processed Data

Introduction

You'll be using a sample of expression data from a study using Affymetrix (one color) arrays that were hybridized to tissues from fetal and human liver and brain tissue. Each hybridization was performed in duplicate. Many other tissues were also profiled but won't be used for these exercises.

What we'll be doing to analyze these data:

You'll be using Excel to do most of the mathematical analyses, since this will show the exact formulas used to perform every step of the analysis pipeline. As a result, you'll need to use Excel functions and be familiar with some Excel conventions. See the Excel help for the details.

Preliminary information: Image analysis and calculation of expression value

  1. As described in Su et al., 2002, human tissue samples were hybridized on Affymetrix (one-color) arrays and chips were scanned. For each tissue, at least two independendent samples were hybridized to separate chips.
  2. Scanned images were quantified (including measurement of background) using standard software.
  3. Data from a probeset (a series of oligos designed to a specific gene target) were used to calculate an expression values for that probeset using standard Affymetrix algorithms.
  4. See the manuals from Affymetrix for more information about these processes, and the Statistical Algorithms Description Document for the actual equations used.
  5. Note that these analysis protocols are generally specific to the chip type and its manufacturer.

Class 1 exercises

Part I. Normalization of expression data

  1. Why? Chips may have been hybridized to different amounts of RNA, for different amounts of time, with different batches of solutions, etc. Normalization should make any comparisons between chips more meaningful.
  2. Download the starting data for the exercises.
  3. Look at the expression data for a selected series of experiments
  4. Calculate the median of all expression values from each chip.
  5. To perform global median normalization, on a new sheet ("norm") of the same file, scale each column of data so that each median is 100.

Part II. Filtering of low intensity data

  1. Why? Since subsequent analysis will examine ratios of expression levels between two tissues, we don't want to spend our energy looking at expression intensities of genes that might change from close to zero to a larger intensity still close to zero (or vice versa).
  2. You have two choices for filtering low intensity data, so do either A or B:
  3. A. Using Affymetrix Absent/Present calls
  4. B. Dropping values similar to background [an alternative to method (A)]
  5. Using your chosen method, what fraction of genes on each chip are present?

Class 2 exercises

Part III. Calculating ratios

  1. On a new sheet ("means"), calculate the mean of each pair of replicated experiments (converting 8 chips to 4 means).
  2. On a new sheet ("ratios"), calculate the ratios (for both brain and liver) of fetal tissue / adult tissue (converting 4 experiments to 2 ratios). Find log base 2 of each of these ratios, so positive log-transformed ratios would show genes with higher expression in adults than in embryos.

Part IV. Identifying differentially expressed genes

  1. Differentially expressed genes can be naively determined by fold changes but more effectively determined by using a statistic such as the t test.
  2. We'll compare the results of these two methods later in Part VI.
  3. Use the t test with each gene to determine if the data on fetal and adult expression are different in the brain and/or liver.
  4. List all the gene IDs for those that meet your significance threshold (such as p < 0.01).
  5. Compile the mean expression values of all genes that show a significant change in expression (to use later for clustering) in the four tissues.

Part V. Clustering

  1. Use any or all of these data sets. The third dataset, being across more tissues, may be the most interesting.
    1. your subset of expression values (from Part IV.5)
    2. a pre-processed set of expression values (not ratios.
    3. a full set of expression ratios (transformed to log base 2), with values compared to the mean across all tissues
  2. Open Cluster 3.0, a clustering application that works on all operating systems. It's an enhanced version of the Eisen clustering program. See the manual for more information about the program.
  3. File > Open and select your file of expression data (one of the files in Part V.1).
  4. Note that there are some filtering and normalization functions on the tabs "Filter Data" and "Adjust Data", but we've already performed these steps.
  5. Try Hierarchical clustering using the default settings.
  6. Try k-Means clustering using the default settings.
  7. Optional: While in JavaTreeView, try Export > Export to Postscript and save all or part of your figure. This will produce an image of optimal resolution. Otherwise, you may wish to export to GIF or bitmap (which are easier to handle in Photoshop, but lower resolution).
  8. Optional: Open the heatmap in Illustrator or Photoshop.

Class 3 exercises

Part VI. Graphing all data

  1. You may try any combination of these graphing techniques. Global visualization of an experiment can be helpful for designing subsequent analysis and for quality control.
  2. Intensity scatterplot (sample)
  3. R-I (M-A) plot (sample)
  4. volcano plot (sample)

Part VII. Functional analysis

  1. Annotation (Excel and web tools)
  2. Comparing two lists
  3. Genome mapping of one gene or a set of genes
  4. Promoter extraction
  5. Identifying potential transcription factor binding sites with TRANSFAC
  6. Gene Ontology analysis
  7. Pathway analysis (KEGG)
  8. Motif finding (Meme)
  9. Comparisons to other expression data

WIBR Microarray Analysis Course 2004