Microarray analysis exercises 2

WIBR Microarray Analysis Course - 2006

Starting Data Processed Data

Class 2 exercises

Part IV. Identifying differentially expressed genes

Differentially expressed genes can be naively determined by fold changes but more effectively determined by using a statistic such as the t test.

We'll compare the results of these two methods later in Part VI.

Use the t test with each gene to determine if the data on fetal and adult expression are different in the brain and/or liver.

Use the Excel "TTEST" function on the "ttest" sheet. Ex: =TTEST(logs!B2:C2,logs!D2:E2,2,3)
The "TTEST" function takes four arguments: the first array, the second array, the number of tails, the type.
The number of tails is 2, since one tissue can have an expression that is lower or higher than the other.
The type of t test is 3, which refers to two-sample unequal variance.
We're safer by not assuming that each genes exhibits the same variance in each tissue, but we do lose some statistical power.

Use the "Absent/Present" calls from the Affymetrix algorithm to flag genes with questionable expression levels.

In the process of converting probe data into one probeset measurement, Affymetrix calculates p-values preflecting confidence that the gene is present in the sample, and these are used to classify each probeset as Absent, Present, or Marginal.
We don't want to spend our energy looking at expression intensities of genes that might change from close to zero to a larger intensity still close to zero (or vice versa).
Our current task is to flag all genes that are have "Absent" calls in all brain samples or in all liver samples, using the original data in the "ap" sheet.
One way to do this is to merge the calls into one cell and then test if it's "AAAA".
On the "ttest" sheet, use the formulas "IF" and "CONCATENATE" to do this. ex: =IF(CONCATENATE(ap!B2,ap!C2,ap!D2,ap!E2)="AAAA","A","P")
The syntax of IF is: logical test, value if true, value if false

Sort data and remove non-expressed probesets.

It may be easiest to start by copying the sheet of t test output "Paste Special > Values" into another sheet ("to_sort") or file before performing "Data > Sort".
To filter out uninformative data, delete all rows (probesets and associated data) for probesets which are expressed in no brain or liver hybridizations. This will also help reduce the corrections necessary for multiple hypothesis testing.
- One way to do this is sort the "to_sort" sheet by both columns with A/P call summmaries
- Delete all rows of probesets with "A" calls in both brain and liver.

(Optional) Correct t-test p-values for multiple hypothesis testing by calculating the False Discovery Rate (FDR)

Note that the actual p-values representing confidence for differential expression are raw values. If they were to be corrected for multiple hypothesis testing (since were doing lots of t-tests), they'd be much higher.
We can start to calculate FDR for the brain data:

Start by sorting all rows in increasing order of t-test p-values for the brain (adult vs. fetal) analysis.
Note the number of probesets we're analyzing
Copy the last raw p-value into the column of FDR p-values.
For the fields above the last, apply the formula
fdr = min(raw * (num_rows/rank_this_probeset), fdr_for_gene_one_row_below)
ex: =MIN(B7923 * 7923/RANK(B7923,B$2:B7924,1),F7924)
Note that the rank is calculated for the list of raw p-values in ascending order.
Paste this formula in the cells above to correct for all probesets.

If you wish to do the FDR calculation for liver data, you'll need to re-sort the rows by liver p-values (and you'll lose the brain FDR values, unless you first select the column and "Paste Special > Values" onto the same column.
For this dataset with only two replicates, you'll find that no p-values corrected by FDR come close to a usual p-value threshold for significance.
What we really need to do are more replicates. If that's not possible (like with these sample data), we can use raw p-values but be aware that we're going to have a high rate of false positives (i.e., genes that we are calling differentially expressed but that really aren't).

List all the gene IDs for those that meet your significance threshold (such as raw p < 0.01) and are present in at least one sample.

Get 2 lists (one for each tissue) of gene IDs that have a p-value < 0.01 and have a "P" call based on 4 chips.

You can sort by the two columns for a tissue in either order (or both at the same time, with a formula).
After you select the lines that pass one criteria, sort them by the other criteria.
Paste the two lists into the "lists" sheet.

Use the Compare two lists tool to get the non-redundant union of these lists.
Paste this combined list into a new sheet (like "selected").

Compile all the log-transformed data for all genes that show a significant change in expression (to use later for clustering) in the four tissues.

On the "selected" sheet use the Excel "VLOOKUP" function. Ex: =VLOOKUP($A2,logs!$A$2:$O$12627,10,FALSE)
The "VLOOKUP" function takes 4 arguments: the value to search for, the table to search (containing the value to search for in the first column), the column number from which the matching value is returned, "FALSE" (to indicate that you want an exact match rather than the closest match).
The "table to search" is the 5 columns (1 columns of gene IDs + 4 columns for the 4 tissues) of mean expression values.
Note that the positions of the table to search must be fixed (with a "$" before each column and row").
Note that this command, when copied into lots of cells, can take the computer a while to perform.
Save this sheet as a text file by either one of these methods:

"File > Save As", and for "Save as type", choose "Text (tab delimited)".
Copy the sheet into a text editor and save that file.

Optional: The BaRC submatrix selector, an alternative to Excel's VLOOKUP

The BaRC submatrix selector takes as input a data matrix and a subset of IDs from the first column of the matrix. Then it selects only desired rows from the matrix, in the order of the input ID list.
Save the "logs" sheet as a text file.

Go to the "logs" sheet and try "File >> Save As" and for "Save as Type", choose "Text (Tab delimited)".
Excel may complain that you can only save one sheet and that some information may be lost, but just choose a different name (like "Array_logs.txt") and you'll still have your original big file.

Save the list of Affy IDs from the "selected" sheet as a text file (like "Selected_ids.txt").
Go to the BaRC Submatrix Selector.

Browse and select the data file ("Array_logs.txt")
Browse and select the list of row IDs ("Selected_ids.txt") or paste the list into a text editor.
You can use a third file of column IDs but ignoring that will give you all columns, which is what we want here.
Click on "Select Submatrix" and you should get output in 1-2 seconds (if the program works correctly).

Part V. Clustering

Use any or all of these data sets. The third dataset, being across more tissues, may be the most interesting.
1. a pre-processed set of log2-transformed expression values (not ratios).
2. a full set of expression ratios (transformed to log base 2), with values compared to the mean across all tissues
Open Cluster 3.0, a clustering application that works on all operating systems. It's an enhanced version of the Eisen clustering program. See the manual for more information about the program.
File > Open and select your file of expression data (one of the files in Part V.1).
Note that there are some filtering and normalization functions on the tabs "Filter Data" and "Adjust Data", but we've already performed these steps.
Try Hierarchical clustering using the default settings.

Go to the "Hierarchical" table and check "Cluster" under genes
Click on "Centroid Linkage" (or "Average Linkage") to use a clustering algorithm that is not sensitive to outliers.
When clustering is completed it'll be shown at the bottom of the window.
Cluster 3.0 generates several files during clustering:

The .cdt file (containing the re-ordered expression data) will be read by JavaTreeView.
For hierarchical clustering, .gtr and .atr files describe the structure of the gene and/or array trees.
For k-means clustering, the .kgg file lists the genes in each of the clusters.

Look at the .cdt output file in a text editor:

Note the new column GWEIGHT (for gene weight) and the new row EWEIGHT (for experiment weight)
You may modify these weights for future clustering (to give more weight, for example, to certain arrays).

Open JavaTreeView for visualizing your data as a heatmap.

Open and view your initial (pre-clustered) text file.
Open and view your final (clustered) file (with a .cdt extension).
Try selected a region of the data to get a more detailed view.
Try Settings > Pixel Settings and adjust the contrast to get the most informative view for your data.
Note that if you used expression values (rather than ratios), you'll only see two colors and those between them.
Try clustering across Genes and Arrays (tissues) to analyze tissue relatedness.
If you wish to use the web link feature, go to Settings > Url Settings and use the link https://www.affymetrix.com/analysis/netaffx/fullrecord.affx?pk=HG-U95AV2:HEADER (so probeset 32615_at is linked to the Affymetrix NetAffx page https://www.affymetrix.com/analysis/netaffx/fullrecord.affx?pk=HG-U95AV2:32615_at). This Affymetrix site requires free registration but provides a lot of good data.

Try k-Means clustering using the default settings.

Follow the same steps as you did with hierarchical clustering above, but after opening the file, go to the k-Means tab, check "Organize genes" and click on Execute.

Optional: While in JavaTreeView, try Export > Export to Postscript and save all or part of your figure. This will produce an image of optimal resolution. Otherwise, you may wish to export to GIF or bitmap (which are easier to handle in Photoshop, but lower resolution).
Optional: Open the heatmap in Illustrator or Photoshop.

WIBR Microarray Analysis Course - 2006