Microarray analysis exercises

WIBR Microarray Analysis Course 2004

Starting Data Processed Data

Introduction

You'll be using a sample of expression data from a study using Affymetrix (one color) arrays that were hybridized to tissues from fetal and human liver and brain tissue. Each hybridization was performed in duplicate. Many other tissues were also profiled but won't be used for these exercises.

What we'll be doing to analyze these data:

normalize data from these eight chips
filter out low intensity data
calculate expression ratios of genes between two different tissues
use a common statistical test to identify differentially expressed genes
cluster a differentially expressed subset of all genes to identify those with similar expression profiles
try to find what functions specific groups of genes (with similar expression profiles) have in common

You'll be using Excel to do most of the mathematical analyses, since this will show the exact formulas used to perform every step of the analysis pipeline. As a result, you'll need to use Excel functions and be familiar with some Excel conventions. See the Excel help for the details.

Preliminary information: Image analysis and calculation of expression value

As described in Su et al., 2002, human tissue samples were hybridized on Affymetrix (one-color) arrays and chips were scanned. For each tissue, at least two independendent samples were hybridized to separate chips.
Scanned images were quantified (including measurement of background) using standard software.
Data from a probeset (a series of oligos designed to a specific gene target) were used to calculate an expression values for that probeset using standard Affymetrix algorithms.
See the manuals from Affymetrix for more information about these processes, and the Statistical Algorithms Description Document for the actual equations used.
Note that these analysis protocols are generally specific to the chip type and its manufacturer.

Class 1 exercises

Part I. Normalization of expression data

Why? Chips may have been hybridized to different amounts of RNA, for different amounts of time, with different batches of solutions, etc. Normalization should make any comparisons between chips more meaningful.

Download the starting data for the exercises.
Look at the expression data for a selected series of experiments

Down the first column are the Affymetrix probeset IDs, each corresponding to a target gene.
Each other column shows a tissue name, followed by expression values in some arbitrary unit.
There should be two chips each for adult and fetal brain and liver (a total of 8 chips).
Note that since this is a one-color array, the expression measures are absolute values and not ratios. If you're more used to 2-color (cDNA) arrays, this would be similar to (but not exactly the same as) 4 chips, each with one channel/dye for fetal tissue and another for adult tissue.

Calculate the median of all expression values from each chip.

Use the Excel "MEDIAN" function to calculate the median at the bottom of each column. Ex: =MEDIAN(B2:B12627)

To perform global median normalization, on a new sheet ("norm") of the same file, scale each column of data so that each median is 100.

Divide each expression signal in a chip by the average for chip and multiply by 100. Ex: =(raw!B2/raw!B$12628)*100 [if B12628 contains the median for the chip]
Global mean normalization is also possible but more influenced by outliers. If you want to compare mean and median yourself [after the class], try both methods and then compare the results with scatterplots.

Part II. Filtering of low intensity data

Why? Since subsequent analysis will examine ratios of expression levels between two tissues, we don't want to spend our energy looking at expression intensities of genes that might change from close to zero to a larger intensity still close to zero (or vice versa).

You have two choices for filtering low intensity data, so do either A or B:

A. filtering by Affymetrix Absent/Present calls, or
B. filtering by dropping values similar to background

A. Using Affymetrix Absent/Present calls

Affymetrix performs their Absent/Present calls from p-values associated with detection calls using all probe data for a probeset. See the Statistical Algorithms Description Document for the details.
Look at the sheet called "ap", showing Absent/Present calls as determined by Affymetrix.
On a new sheet ("norm_filt") of the same file, set the level of each "A" probeset to 1. We won't use 0 because this could cause subsequent problems with division and logarithms.

Use the Excel "IF" function to show the normalized value if the gene is "Present" or use 1 otherwise. Ex: =IF(ap!B2="P",norm!B2,1)
The "IF" function takes three arguments: the statement to test, what to do if it's true, what to do if it's false.

B. Dropping values similar to background [an alternative to method (A)]

One probably wants to discount any expression values which are similar to the background intensity of the chip.
"Similar" can be defined as less that two times the standard deviation.
Standard deviation of the background is not available for these chips, but the standard deviation of the negative control probesets has been calculated to be 20 (for the data with the median adjusted to 100).
On a new sheet ("norm_filt") of the same file, set to 1 the level of each probeset with an expression value < 40. We won't use 0 because this could cause subsequent problems with division and logarithms.

Use the Excel "IF" function to show the normalized value if the level is greater than 40 or use 1 otherwise. Ex: =IF(norm!B2>40,norm!B2,1)

Using your chosen method, what fraction of genes on each chip are present?

Copy one column of data and "Paste Special > Values" into another file. "Data > Sort" and count the rows that are greater than 1 divided by the total rows.

Class 2 exercises

Part III. Calculating ratios

On a new sheet ("means"), calculate the mean of each pair of replicated experiments (converting 8 chips to 4 means).

Use the Excel "AVERAGE" function. Ex: =AVERAGE(norm_filt!B2:norm_filt!C2)

On a new sheet ("ratios"), calculate the ratios (for both brain and liver) of fetal tissue / adult tissue (converting 4 experiments to 2 ratios). Find log base 2 of each of these ratios, so positive log-transformed ratios would show genes with higher expression in adults than in embryos.

Use the Excel "LOG" function with 2 as the last argument (to show the base). Ex: =LOG(means!B2/means!C2,2)

Part IV. Identifying differentially expressed genes

Differentially expressed genes can be naively determined by fold changes but more effectively determined by using a statistic such as the t test.

We'll compare the results of these two methods later in Part VI.

Use the t test with each gene to determine if the data on fetal and adult expression are different in the brain and/or liver.

Use the Excel "TTEST" function on the "ttest" sheet. Ex: TTEST(norm_filt!B2:C2,norm_filt!D2:E2,2,3)
The "TTEST" function takes four arguments: the first array, the second array, the number of tails, the type.
The number of tails is 2, since one tissue can have an expression that is lower or higher than the other.
The type of t test is 3, which refers to two-sample unequal variance.
But this formula generates an error for data where all expression values have been floored to 1 (which is also when both means are 1).
To prevent this error, we want to check if both means are 1 by using the IF and AND statements:

The "AND" statement checks if the series of statements between the parentheses are true. Ex: =AND(means!B2=1,meansC2=1)
If the "AND" statement is true, print "1"; otherwise do the ttest.
Combine the IF, AND, and TTEST functions: =IF(AND(means!B2=1,means!C2=1),1,TTEST(norm_filt!B2:C2,norm_filt!D2:E2,2,3))

List all the gene IDs for those that meet your significance threshold (such as p < 0.01).

It may be easiest to copy the sheet of t test output "Paste Special > Values" into another sheet ("to_sort") or file before performing "Data > Sort".

Get 2 lists (one for each tissue) of gene IDs that have a p-value < 0.01 and paste them into the "lists" sheet.

Use the Compare two lists tool to get the non-redundant union of these lists.
Paste this combined list into a new sheet (like "selected").

Compile the mean expression values of all genes that show a significant change in expression (to use later for clustering) in the four tissues.

On the "selected" sheet use the Excel "VLOOKUP" function. Ex: =VLOOKUP($A2,means!$A$2:$E$12627,2,FALSE)
The "VLOOKUP" function takes 4 arguments: the value to search for, the table to search (containing the value to search for in the first column), the column number from which the matching value is returned, "FALSE" (to indicate that you want an exact match rather than the closest match).
The "table to search" is the 5 columns (1 columns of gene IDs + 4 columns for the 4 tissues) of mean expression values.
Note that the positions of the table to search must be fixed (with a "$" before each column and row").
Note that this command, when copied into lots of cells, can take the computer a while to perform.
Save this sheet as a text file by either one of these methods:

"File > Save As", and for "Save as type", choose "Text (tab delimited)".
Copy the sheet into a text editor and save that file.

Part V. Clustering

Use any or all of these data sets. The third dataset, being across more tissues, may be the most interesting.
1. a pre-processed set of expression values (not ratios.
2. a full set of expression ratios (transformed to log base 2), with values compared to the mean across all tissues
Open Cluster 3.0, a clustering application that works on all operating systems. It's an enhanced version of the Eisen clustering program. See the manual for more information about the program.
File > Open and select your file of expression data (one of the files in Part V.1).
Note that there are some filtering and normalization functions on the tabs "Filter Data" and "Adjust Data", but we've already performed these steps.
Try Hierarchical clustering using the default settings.

Go to the "Hierarchical" table and check "Cluster" under genes
Click on "Centroid Linkage" (or "Average Linkage") to use a clustering algorithm that is not sensitive to outliers.
When clustering is completed it'll be shown at the bottom of the window.
Cluster 3.0 generates several files during clustering:

The .cdt file (containing the re-ordered expression data) will be read by JavaTreeView.
For hierarchical clustering, .gtr and .atr files describe the structure of the gene and/or array trees.
For k-means clustering, the .kgg file lists the genes in each of the clusters.

Look at the .cdt output file in a text editor:

Note the new column GWEIGHT (for gene weight) and the new row EWEIGHT (for experiment weight)
You may modify these weights for future clustering (to give more weight, for example, to certain arrays).

Click on "Launch JavaTreeView" to open JavaTreeView for visualizing your data as a heatmap.
Open and view your initial (pre-clustered) text file.
Open and view your final (clustered) file (with a .cdt extension).
Try selected a region of the data to get a more detailed view.
Try Settings > Pixel Settings and adjust the contrast to get the most informative view for your data.
Note that if you used expression values (rather than ratios), you'll only see two colors and those between them.
Try clustering across Genes and Arrays (tissues) to analyze tissue relatedness.

Try k-Means clustering using the default settings.

Follow the same steps as you did with hierarchical clustering above, but after opening the file, go to the k-Means tab, check "Organize genes" and click on Execute.

Optional: While in JavaTreeView, try Export > Export to Postscript and save all or part of your figure. This will produce an image of optimal resolution. Otherwise, you may wish to export to GIF or bitmap (which are easier to handle in Photoshop, but lower resolution).
Optional: Open the heatmap in Illustrator or Photoshop.

Class 3 exercises

Part VI. Graphing all data

You may try any combination of these graphing techniques. Global visualization of an experiment can be helpful for designing subsequent analysis and for quality control.

Intensity scatterplot (sample)

Go to the "means" sheet and select two columns for either brain or liver.
Click on the Chart Wizard and for Chart Type, select "XY(Scatter)".
Click on "Next >" twice and label the chart and axes.
Click on "Finish".
To make the axes logarithmic, click on an axis to get the Format Axis box.

Select the Scale tab and check the "Logarithmic scale" box.
Repeat for the other axis.

R-I (M-A) plot (sample)

A "ratio-intensity plot" looks like a scatterplot that has been rotated 45 degrees.
It's the most common plot associated with lowess normalization.
After chooosing the two columns of data you wish to compare, begin by calculating two new values for each expression value using the Excel "LOG" function, like =LOG(B2/C2,2) where the last argument is the base.
1. I = log₂(experiment 1 * experiment 2)
2. R = log₂(experiment 1 / experiment 2)
Follow the same instructions as for a scatterplot (above).

volcano plot (sample)

These plots will help compare two methods for determination of differential expression: fold changes and t tests.
The x-axis is the ratio between two tissues and the y-axis is the p-value from the t test from the same two tissues.

Start by choosing a tissue and copying the log₂ ratio data and t test data into two adjacent columns.
Replace any non-numerical characters (if present) with 1.
Click on the Chart Wizard and for Chart Type, select "XY(Scatter)", and go through the wizard as before.
Once you have a graph, click on the y-axis to get the Format Axis box.

Select the Scale tab and check the "Logarithmic scale" and the "Values in reverse order" boxes.

Click on the x-axis to get the Format Axis box.

Select the Patterns tab and select "High" for tick mark labels.

Part VII. Functional analysis

Annotation (Excel and web tools)

Download the Excel annotation tool for the Affymetrix U95 chip used in this experiment.
This Excel file was created using an annotation file from Affymetrix and a simple use of the "VLOOKUP" function.
Go the "list" sheet and paste in a list of Affymetrix IDs showing, for example, your list of genes with differential expression in the fetal vs. adult brain.
Use F9 (if required) to calculate the VLOOKUP functions.
Optional: browse the list of genes to find any of your favorites.

Comparing two lists

Are there any genes which are differentially expressed in both brain and liver?
Looking at your selected set of genes (those differentially expressed in brain and/or liver), compare

genes that are expressed at a different level in the fetal liver and the adult liver
genes that are expressed at a different level in the fetal brain and the adult brain

Go the Compare two lists tool and paste in both lists.
What genes are in the intersection of both lists?
Record the following three numbers:

number of genes differentially expressed only in the fetal liver
number of genes differentially expressed only in the fetal brain
number of genes differentially expressed in both the fetal liver and brain (the intersection of the two original lists)

Use the Venn diagram generator to draw a figure of these data (or use {55, 74, 98}).
If your browser isn't configured to view SVG graphics, save the SVG file as text and choose a file name ending in .svg
If you want to save the image, right click on it and select "Save SVG As".
The .svg file can then be opened and edited in Illustrator.

Genome mapping of one gene or a set of genes

Go to the annotation file from part VII.1 and select the symbol of an interesting gene from the list.
Go to the UCSC Human Genome Browser Gateway, enter the gene symbol in the "position" field, and click "submit."
On the new page, use a link to a RefSeq Gene (if you hit one) or a Known Gene. Multiple hits at this stage can be the result of multiple transcripts.
The browser can show lots of different types of data and get you to genomic sequence. Ask if you'd like to know more.
To map a set of genes, go to the annotation file from part VII.1 and select the entries in the GeneSymbol column.
Go to a tool like the WIBR human genome mapper and paste your gene symbols (or you can input up to three gene sets at once).
Click on MAP to see if any of your genes map to your favorite parts of the genome or if they appear to be clustered at any particular loci.

Promoter extraction

Since gene expression is regulated in part via the binding of transcription factors to gene promoters, you may want to get some promoter sequence.
Different tools can be used to extract promoter sequence (in addition to using a genome browser from the last step).
Go to the annotation file from part VII.1 and select the RefSeq Transcript ID for some interesting genes.
You may need to search and replace any characters that aren't gene IDs.
Go to the RefSeq promoter extractor and paste a list of RefSeq IDs (NM_...).
Select the length of sequence to define your "promoters" (or use the defaults) and get the genomic sequence.
Save the promoter sequences as a text file (or copy into a text editor) to use for subsequent analyses.

Identifying potential transcription factor binding sites with TRANSFAC

Go to TRANSFAC and use your BaRC username and password for your tak account.
To search for potential transcription factor binding sites in any promoters, click on MATCH on the left column.
Paste your promoter(s) into the big box.
Select "vertebrates" under "Group of matrices".
Note the "Cut-off selection for matrix group" that can be adjusted for the rate of false positives and false negatives.
Click on "Submit the form".
Look at the output: does it make sense?

Gene Ontology analysis

Given a list of genes, how can we figure out what functions the genes have in common, or more precisely, what functions are over-represented in the gene set?
Gene Ontology (GO) annotation provides information to effectively answer this question using one of many available tools.
Go to DAVID and click on "Upload New List" under DAVID tools.
Paste one of your gene lists, select AFFYID, and "Submit Text".
On the next page, click on GOcharts.
Select one (or more) of the three ontologies under CLASSIFICATION TYPE and click on "Chart Values!"
Note that some terms are too general and some too specific to be informative, but those in between should tell something about what's special about the genes in your list (if there is something special).
To find how much any of these GO terms occur more often than you'd expect in a subset of the genes on the Affymetrix U95 chip, select EASEonline under DAVID tools.
Under SELECT BACKGROUND LIST, choose U95A and click on Submit.
On the next page, 4 numbers are associated with each Category:

LH (list hits): number of genes with this GO term in your gene list
LT (list total): number of genes in your gene list mapped to any term in this ontology ("system")
PH (population hits): number of genes with this GO term on the background list (the whole chip)
PT (population total): number of genes on the background list (the whole chip) mapped to any term in this ontology ("system")

The EASE Score shows the level of confidence that this term is over-represented in your gene list.

Pathway analysis (KEGG)

We can try to map a gene set to known pathways using a database such as KEGG.
While in DAVID (from the last step), select KEGGCharts under DAVID Tools and click on "Chart Pathways!"
Following any of the pathway links takes you to a KEGG pathway, with the proteins from your genes colored red.

Motif finding (Meme)

Since we can get promoters from a set of co-regulated genes, we can try to identify over-represented motifs that may act as transcription factor binding sites.
Go to Meme, a popular tool for this type of analysis, and paste the promoters of several co-expressed genes. Note that Meme takes a lot of computing power, so you may have problems with more than 10 sequences at once.
Enter your email address and click on Start Search.
A response can take hours or days, so don't hold your breath.
See the sample Meme output to get an idea of the output to expect. Note that only some "motifs" may be biologically menaningful.

Comparisons to other expression data

How does this experiment compare to other experiments? What type of expression patterns do your interesting genes show in other experiments?
Go to a repository like GEO.
To look at one gene in detail, enter the gene name or Affymetrix ID next to "Gene profiles" and click on GO.
To look for a type of experiment, enter a term (ex: embryo) next to DataSets and click on GO.

Click on a GDS (Geo Dataset) to go to a new page with a link to download the dataset.
Click on a GSM (Geo Sample) to get more information and expression data from a specific chip/hybridization.

WIBR Microarray Analysis Course 2004