Microarray analysis exercises 2

WIBR Microarray Analysis Course - May 2005

Starting Data Processed Data

Class 2 exercises

Part IV. Identifying differentially expressed genes

Differentially expressed genes can be naively determined by fold changes but more effectively determined by using a statistic such as the t test.

We'll compare the results of these two methods later in Part VI.

Use the t test with each gene to determine if the data on fetal and adult expression are different in the brain and/or liver.

Use the Excel "TTEST" function on the "ttest" sheet. Ex: TTEST(logs!B2:C2,logs!D2:E2,2,3)
The "TTEST" function takes four arguments: the first array, the second array, the number of tails, the type.
The number of tails is 2, since one tissue can have an expression that is lower or higher than the other.
The type of t test is 3, which refers to two-sample unequal variance.
We're safer by not assuming that each genes exhibits the same variance in each tissue, but we do lose some statistical power.

Use the "Absent/Present" calls from the Affymetrix algorithm to flag genes with questionable expression levels.

In the process of converting probe data into one probeset measurement, Affymetrix calculates p-values preflecting confidence that the gene is present in the sample, and these are used to classify each probeset as Absent, Present, or Marginal.
We don't want to spend our energy looking at expression intensities of genes that might change from close to zero to a larger intensity still close to zero (or vice versa).
Our current task is to flag all genes that are have "Absent" calls in all brain samples or in all liver samples, using the original data in the "ap" sheet.
One way to do this is to merge the calls into one cell and then test if it's "AAAA".
On the "ttest" sheet, use the formulas "IF" and "CONCATENATE" to do this. ex: =IF(CONCATENATE(ap!B2,ap!C2,ap!D2,ap!E2)="AAAA","A","P")
The syntax of IF is: logical test, value if true, value if false

List all the gene IDs for those that meet your significance threshold (such as p < 0.05) and are present in at least one sample.

Note that the actual p-values representing confidence for differential expression are raw values. If they were to be corrected for multiple hypothesis testing (since were doing lots of t-tests), they'd be much higher. For these exercises however, we'll use the raw values but treat them as if they have been corrected.

It may be easiest to copy the sheet of t test output "Paste Special > Values" into another sheet ("to_sort") or file before performing "Data > Sort".

Get 2 lists (one for each tissue) of gene IDs that have a p-value < 0.05 and have a "P" call based on 4 chips.

You can sort by the two columns for a tissue in either order (or both at the same time, with a formula).
After you select the lines that pass one criteria, sort them by the other criteria.
Paste the two lists into the "lists" sheet.

Use the Compare two lists tool to get the non-redundant union of these lists.
Paste this combined list into a new sheet (like "selected").

Compile all the log-transformed data for all genes that show a significant change in expression (to use later for clustering) in the four tissues.

On the "selected" sheet use the Excel "VLOOKUP" function. Ex: =VLOOKUP($A2,logs!$A$2:$O$12627,2,FALSE)
The "VLOOKUP" function takes 4 arguments: the value to search for, the table to search (containing the value to search for in the first column), the column number from which the matching value is returned, "FALSE" (to indicate that you want an exact match rather than the closest match).
The "table to search" is the 5 columns (1 columns of gene IDs + 4 columns for the 4 tissues) of mean expression values.
Note that the positions of the table to search must be fixed (with a "$" before each column and row").
Note that this command, when copied into lots of cells, can take the computer a while to perform.
Save this sheet as a text file by either one of these methods:

"File > Save As", and for "Save as type", choose "Text (tab delimited)".
Copy the sheet into a text editor and save that file.

Optional: The BaRC submatrix selector, an alternative to Excel's VLOOKUP

The BaRC submatrix selector takes as input a data matrix and a subset of IDs from the first column of the matrix. Then it selects only desired rows from the matrix, in the order of the input ID list.
Save the "logs" sheet as a text file.

Go to the "logs" sheet and try "File >> Save As" and for "Save as Type", choose "Text (Tab delimited)".
Excel may complain that you can only save one sheet and that some information may be lost, but just choose a different name (like "Array_logs.txt") and you'll still have your original big file.

Save the list of Affy IDs from the "selected" sheet as a text file (like "Selected_ids.txt").
Go to the BaRC Submatrix Selector.

Browse and select the data file ("Array_logs.txt")
Browse and select the list of row IDs ("Selected_ids.txt") or paste the list into a text editor.
You can use a third file of column IDs but ignoring that will give you all columns, which is what we want here.
Click on "Select Submatrix" and you should get output in 1-2 seconds (if the program works correctly).

Part V. Clustering

Use any or all of these data sets. The third dataset, being across more tissues, may be the most interesting.
1. a pre-processed set of log2-transformed expression values (not ratios).
2. a full set of expression ratios (transformed to log base 2), with values compared to the mean across all tissues
Open Cluster 3.0, a clustering application that works on all operating systems. It's an enhanced version of the Eisen clustering program. See the manual for more information about the program.
File > Open and select your file of expression data (one of the files in Part V.1).
Note that there are some filtering and normalization functions on the tabs "Filter Data" and "Adjust Data", but we've already performed these steps.
Try Hierarchical clustering using the default settings.

Go to the "Hierarchical" table and check "Cluster" under genes
Click on "Centroid Linkage" (or "Average Linkage") to use a clustering algorithm that is not sensitive to outliers.
When clustering is completed it'll be shown at the bottom of the window.
Cluster 3.0 generates several files during clustering:

The .cdt file (containing the re-ordered expression data) will be read by JavaTreeView.
For hierarchical clustering, .gtr and .atr files describe the structure of the gene and/or array trees.
For k-means clustering, the .kgg file lists the genes in each of the clusters.

Look at the .cdt output file in a text editor:

Note the new column GWEIGHT (for gene weight) and the new row EWEIGHT (for experiment weight)
You may modify these weights for future clustering (to give more weight, for example, to certain arrays).

Open JavaTreeView for visualizing your data as a heatmap.

Open and view your initial (pre-clustered) text file.
Open and view your final (clustered) file (with a .cdt extension).
Try selected a region of the data to get a more detailed view.
Try Settings > Pixel Settings and adjust the contrast to get the most informative view for your data.
Note that if you used expression values (rather than ratios), you'll only see two colors and those between them.
Try clustering across Genes and Arrays (tissues) to analyze tissue relatedness.
If you wish to use the web link feature, go to Settings > Url Settings and use the link https://www.affymetrix.com/analysis/netaffx/fullrecord.affx?pk=HG-U95AV2:HEADER (so probeset 32615_at is linked to the Affymetrix NetAffx page https://www.affymetrix.com/analysis/netaffx/fullrecord.affx?pk=HG-U95AV2:32615_at). This Affymetrix site requires free registration but provides a lot of good data.

Try k-Means clustering using the default settings.

Follow the same steps as you did with hierarchical clustering above, but after opening the file, go to the k-Means tab, check "Organize genes" and click on Execute.

Optional: While in JavaTreeView, try Export > Export to Postscript and save all or part of your figure. This will produce an image of optimal resolution. Otherwise, you may wish to export to GIF or bitmap (which are easier to handle in Photoshop, but lower resolution).
Optional: Open the heatmap in Illustrator or Photoshop.

WIBR Microarray Analysis Course - May 2005