Microarray analysis exercises 2
Class 2 exercises
Part IV. Identifying differentially expressed genes
- Differentially expressed genes can be naively determined by fold changes but more effectively determined by using a statistic such as the t test.
- We'll compare the results of these two methods later in Part VI.
- Use the t test with each gene to determine if the data on fetal and adult expression are different in the brain and/or liver.
- Use the Excel "TTEST" function on the "ttest" sheet. Ex: TTEST(norm_filt!B2:C2,norm_filt!D2:E2,2,3)
- The "TTEST" function takes four arguments: the first array, the second array, the number of tails, the type.
- The number of tails is 2, since one tissue can have an expression that is lower or higher than the other.
- The type of t test is 3, which refers to two-sample unequal variance.
- But this formula generates an error for data where all expression values have been floored to 1 (which is also when both means are 1).
- To prevent this error, we want to check if both means are 1 by using the IF and AND statements:
- The "AND" statement checks if the series of statements between the parentheses are true. Ex: =AND(means!B2=1,meansC2=1)
- If the "AND" statement is true, print "1"; otherwise do the ttest.
- Combine the IF, AND, and TTEST functions: =IF(AND(means!B2=1,means!C2=1),1,TTEST(norm_filt!B2:C2,norm_filt!D2:E2,2,3))
- List all the gene IDs for those that meet your significance threshold (such as p < 0.01).
- It may be easiest to copy the sheet of t test output "Paste Special > Values" into another sheet ("to_sort") or file before performing "Data > Sort".
- Get 2 lists (one for each tissue) of gene IDs that have a p-value < 0.01 and paste them into the "lists" sheet.
- Use the Compare two lists tool to get the non-redundant union of these lists.
- Paste this combined list into a new sheet (like "selected").
- Compile the mean expression values of all genes that show a significant change in expression (to use later for clustering) in the four tissues.
- On the "selected" sheet use the Excel "VLOOKUP" function. Ex: =VLOOKUP($A2,means!$A$2:$E$12627,2,FALSE)
- The "VLOOKUP" function takes 4 arguments: the value to search for, the table to search (containing the value to search for in the first column),
the column number from which the matching value is returned, "FALSE" (to indicate that you want an exact match rather than the closest match).
- The "table to search" is the 5 columns (1 columns of gene IDs + 4 columns for the 4 tissues) of mean expression values.
- Note that the positions of the table to search must be fixed (with a "$" before each column and row").
- Note that this command, when copied into lots of cells, can take the computer a while to perform.
- Save this sheet as a text file by either one of these methods:
- "File > Save As", and for "Save as type", choose "Text (tab delimited)".
- Copy the sheet into a text editor and save that file.
Part V. Clustering
- Use any or all of these data sets. The third dataset, being across more tissues, may be the most interesting.
- your subset of expression values (from Part IV.5)
- a pre-processed set of expression values (not ratios.
- a full set of expression ratios (transformed to log base 2), with values compared to the mean across all tissues
- Open Cluster 3.0, a clustering application that works on all operating systems.
It's an enhanced version of the Eisen clustering program. See the manual
for more information about the program.
- File > Open and select your file of expression data (one of the files in Part V.1).
- Note that there are some filtering and normalization functions on the tabs "Filter Data" and "Adjust Data", but we've already performed these steps.
- Try Hierarchical clustering using the default settings.
- Go to the "Hierarchical" table and check "Cluster" under genes
- Click on "Centroid Linkage" (or "Average Linkage") to use a clustering algorithm that is not sensitive to outliers.
- When clustering is completed it'll be shown at the bottom of the window.
- Cluster 3.0 generates several files during clustering:
- The .cdt file (containing the re-ordered expression data) will be read by JavaTreeView.
- For hierarchical clustering, .gtr and .atr files describe the structure of the gene and/or array trees.
- For k-means clustering, the .kgg file lists the genes in each of the clusters.
- Look at the .cdt output file in a text editor:
- Note the new column GWEIGHT (for gene weight) and the new row EWEIGHT (for experiment weight)
- You may modify these weights for future clustering (to give more weight, for example, to certain arrays).
- Open JavaTreeView for visualizing your data as a heatmap.
- Open and view your initial (pre-clustered) text file.
- Open and view your final (clustered) file (with a .cdt extension).
- Try selected a region of the data to get a more detailed view.
- Try Settings > Pixel Settings and adjust the contrast to get the most informative view for your data.
- Note that if you used expression values (rather than ratios), you'll only see two colors and those between them.
- Try clustering across Genes and Arrays (tissues) to analyze tissue relatedness.
- If you wish to use the web link feature, go to Settings > Url Settings and use the link https://www.affymetrix.com/analysis/netaffx/fullrecord.affx?pk=HG-U95AV2:HEADER
(so probeset 32615_at is linked to the Affymetrix NetAffx page https://www.affymetrix.com/analysis/netaffx/fullrecord.affx?pk=HG-U95AV2:32615_at).
This Affymetrix site requires free registration but provides a lot of good data.
- Try k-Means clustering using the default settings.
- Follow the same steps as you did with hierarchical clustering above, but after opening the file, go to the k-Means tab, check "Organize genes" and click on Execute.
- Optional: While in JavaTreeView, try Export > Export to Postscript and save all or part of your figure. This will produce an image of optimal resolution.
Otherwise, you may wish to export to GIF or bitmap (which are easier to handle in Photoshop, but lower resolution).
- Optional: Open the heatmap in Illustrator or Photoshop.
WIBR Microarray Analysis Course 2004