Introduction
You'll be using a sample of expression data from a study using Affymetrix (one color) arrays that were hybridized to tissues from fetal and human liver and brain tissue.
Each hybridization was performed in duplicate.
Many other tissues were also profiled but won't be used for these exercises.
What we'll be doing to analyze these data:
- normalize data from these eight chips
- filter out low intensity data
- calculate expression ratios of genes between two different tissues
- use a common statistical test to identify differentially expressed genes
- cluster a differentially expressed subset of all genes to identify those with similar expression profiles
- try to find what functions specific groups of genes (with similar expression profiles) have in common
You'll be using Excel to do most of the mathematical analyses, since this will show the exact formulas used to perform every step of the analysis pipeline.
As a result, you'll need to use Excel functions and be familiar with some Excel conventions. See the Excel help for the details.
Preliminary information: Image analysis and calculation of expression value
- As described in Su et al., 2002,
human tissue samples were hybridized on Affymetrix (one-color) arrays and chips were scanned.
For each tissue, at least two independendent samples were hybridized to separate chips.
- Scanned images were quantified (including measurement of background) using standard software.
- Data from a probeset (a series of oligos designed to a specific gene target) were used to calculate an expression values for that probeset
using standard Affymetrix algorithms.
- See the manuals from Affymetrix for more information about these processes,
and the Statistical Algorithms Description Document
for the actual equations used.
- Note that these analysis protocols are generally specific to the chip type and its manufacturer.
Class 1 exercises
Part I. Normalization of expression data
- Why? Chips may have been hybridized to different amounts of RNA, for different amounts of time, with different batches of solutions, etc.
Normalization should make any comparisons between chips more meaningful.
- Download the starting data for the exercises.
- Look at the expression data for a selected series of experiments
- Down the first column are the Affymetrix probeset IDs, each corresponding to a target gene.
- Each other column shows a tissue name, followed by expression values in some arbitrary unit.
- There should be two chips each for adult and fetal brain and liver (a total of 8 chips).
- Note that since this is a one-color array, the expression measures are absolute values and not ratios.
If you're more used to 2-color (cDNA) arrays, this would be similar to (but not exactly the same as) 4 chips, each with one channel/dye for fetal tissue and another for adult tissue.
- Calculate the median of all expression values from each chip.
- Use the Excel "MEDIAN" function to calculate the median at the bottom of each column. Ex: =MEDIAN(B2:B12627)
- To perform global median normalization, on a new sheet ("norm") of the same file, scale each column of data so that each median is 100.
- Divide each expression signal in a chip by the average for chip and multiply by 100. Ex: =(raw!B2/raw!B$12628)*100 [if B12628 contains the median for the chip]
- Global mean normalization is also possible but more influenced by outliers.
If you want to compare mean and median yourself [after the class], try both methods and then compare the results with scatterplots.
Part II. Filtering of low intensity data
- Why? Since subsequent analysis will examine ratios of expression levels between two tissues,
we don't want to spend our energy looking at expression intensities of genes
that might change from close to zero to a larger intensity still close to zero (or vice versa).
- You have two choices for filtering low intensity data, so do either A or B:
- A. filtering by Affymetrix Absent/Present calls, or
- B. filtering by dropping values similar to background
- A. Using Affymetrix Absent/Present calls
- Affymetrix performs their Absent/Present calls from p-values associated with detection calls using all probe data for a probeset.
See the Statistical Algorithms Description Document for the details.
- Look at the sheet called "ap", showing Absent/Present calls as determined by Affymetrix.
- On a new sheet ("norm_filt") of the same file, set the level of each "A" probeset to 1.
We won't use 0 because this could cause subsequent problems with division and logarithms.
- Use the Excel "IF" function to show the normalized value if the gene is "Present" or use 1 otherwise. Ex: =IF(ap!B2="P",norm!B2,1)
- The "IF" function takes three arguments: the statement to test, what to do if it's true, what to do if it's false.
- B. Dropping values similar to background [an alternative to method (A)]
- One probably wants to discount any expression values which are similar to the background intensity of the chip.
- "Similar" can be defined as less that two times the standard deviation.
- Standard deviation of the background is not available for these chips, but the standard deviation of the negative control probesets has been calculated to be 20 (for the data with the median adjusted to 100).
- On a new sheet ("norm_filt") of the same file, set to 1 the level of each probeset with an expression value < 40.
We won't use 0 because this could cause subsequent problems with division and logarithms.
- Use the Excel "IF" function to show the normalized value if the level is greater than 40 or use 1 otherwise. Ex: =IF(norm!B2>40,norm!B2,1)
- Using your chosen method, what fraction of genes on each chip are present?
- Copy one column of data and "Paste Special > Values" into another file. "Data > Sort" and count the rows that are greater than 1 divided by the total rows.
Part III. Calculating ratios
- On a new sheet ("means"), calculate the mean of each pair of replicated experiments (converting 8 chips to 4 means).
- Use the Excel "AVERAGE" function. Ex: =AVERAGE(norm_filt!B2:norm_filt!C2)
- On a new sheet ("ratios"), calculate the ratios (for both brain and liver) of fetal tissue / adult tissue (converting 4 experiments to 2 ratios).
Find log base 2 of each of these ratios, so positive log-transformed ratios would show genes with higher expression in adults than in embryos.
- Use the Excel "LOG" function with 2 as the last argument (to show the base). Ex: =LOG(means!B2/means!C2,2)
WIBR Microarray Analysis Course 2004