Introduction
You'll be using a sample of expression data from a study using Affymetrix (one color) U95A arrays that were hybridized to tissues from fetal and human liver and brain tissue.
Each hybridization was performed in duplicate.
Many other tissues were also profiled but won't be used for these exercises.
What we'll be doing to analyze these data:
- normalize data from these eight chips
- calculate expression ratios of genes between two different tissues
- use a common statistical test to identify differentially expressed genes
- flag low intensity data (most probably background noise)
- cluster a differentially expressed subset of all genes to identify those with similar expression profiles
- try to find what functions specific groups of genes (with similar expression profiles) have in common
You'll be using Excel to do most of the mathematical analyses, since this will show the exact formulas used to perform every step of the analysis pipeline.
As a result, you'll need to use Excel functions and be familiar with some Excel conventions. See the Excel help for the details.
Preliminary information: Image analysis and calculation of expression value
- As described in Su et al., 2002,
human tissue samples were hybridized on Affymetrix (one-color) arrays and chips were scanned.
For each tissue, at least two independent samples were hybridized to separate chips.
- Scanned images were quantified (including measurement of background) using standard software.
- Data from a probeset (a series of oligos designed to a specific gene target) were used to calculate an expression values for that probeset
using standard Affymetrix algorithms.
- See the manuals from Affymetrix for more information about these processes,
and the Statistical Algorithms Description Document
for the actual equations used.
- Note that these analysis protocols are generally specific to the chip type and its manufacturer.
Class 1 exercises
Part I. Normalization of expression data
- Why normalize? Chips may have been hybridized to different amounts of RNA, for different amounts of time, with different batches of solutions, etc.
Normalization should remove systematic biases and make any comparisons between chips more meaningful.
- Download the starting data for the exercises.
- Look at the expression data for a selected series of experiments in the "raw" sheet.
- Down the first column are the Affymetrix probeset IDs, each corresponding to a target gene.
- Each other column shows a tissue name, followed by expression values in some arbitrary unit.
- There should be two chips each for adult and fetal brain and liver (a total of 8 chips).
- Note that since this is a one-color array, the expression measures are absolute values and not ratios.
If you're more used to 2-color (cDNA) arrays, this would be similar to (but not exactly the same as) 4 chips, each with one channel/dye for fetal tissue and another for adult tissue.
- Calculate the trimmed mean of all expression values from each chip.
- A trimmed mean calculates a summary value that is somewhere between the mean and median of the set of values.
- To remove the top and bottom 2% of values, and find the mean of the remaining values,
we need to trim 4% of values.
Ex: =TRIMMEAN(B2:B12627, 0.04)
- Note: the Excel functions "MEDIAN" and "AVERAGE" can be used similarly to calculate the median or mean for all values in a chip.
- To perform global normalization by trimmed means, on a new sheet ("norm") of the same file, scale each column of data so that each trimmed mean is 100.
- Divide each expression signal in a chip by the median for the chip and multiply by 100. Ex: =(raw!B2/raw!B$12628)*100 [if B12628 contains the trimmed mean for the chip]
- To confirm that this was correctly done, calculate the trimmed mean of each normalized chip.
- Global mean normalization is also possible but more influenced by outliers.
If you want to compare trimmed mean, mean, and median yourself [after the class], try these methods and then compare the results with scatterplots.
Part II. Calculating ratios
- On a new sheet ("means"), calculate the mean of each pair of replicated experiments (converting 8 chips to 4 means).
- Use the Excel "AVERAGE" function. Ex: =AVERAGE(norm!B2:norm!C2)
- On a new sheet ("ratios"), calculate the ratios (for both brain and liver) of fetal tissue / adult tissue (converting 4 experiments to 2 ratios).
Part III. Log transformations
- Why use logarithms? Log-transformed ratios are helpful so up-regulated and down-regulated genes change by the same amplitudes.
Log-transformed chip intensities are recommended for differential expression testing, since these statistical tests assume normal distributions --
which is true for log-transformed intensities but not untransformed intensities.
- Logarithms can use any base, but base 2 is easiest when transforming ratios, since transformed 2-fold ratios up or down will be +1 or -1.
As a result, we'll do all logs with base 2 to keep thing simplest.
- On a new sheet ("logs"), calculate the logs (base 2) of all expression values, means, and ratios.
- Use the Excel "LOG" function. Ex: =LOG(norm!B2,2)
- Convert all the data from sheets "norm", "means", and "ratios."
WIBR Microarray Analysis Course - 2007