## Microarray analysis exercises 1 - with R

#### Introduction

You'll be using a sample of expression data from a study using Affymetrix (one color) U95A arrays that were hybridized to tissues from fetal and human liver and brain tissue. Each hybridization was performed in duplicate. Many other tissues were also profiled but won't be used for these exercises.

What we'll be doing to analyze these data:

• preprocess the raw data (summarize probe measurements into one measurement for each probe)
• normalize data from these eight chips
• calculate Absent/Present calls which attempts to label genes that are "expressed"
• calculate expression ratios of genes between two different tissues
• use a common statistical test to identify differentially expressed genes
• flag low intensity data (most probably background noise)
• cluster a differentially expressed subset of all genes to identify those with similar expression profiles
• try to find what functions specific groups of genes (with similar expression profiles) have in common

You'll be using R and Bioconductor (a set of packages that run in R) to do most of the mathematical analyses. R is a free, very powerful statistics environment but it requires commands to perform every step of an analysis pipeline. These commands can be pasted into the program. Type '?myCommand' to get a help page about the command 'myCommand'.

#### Preliminary information: Image analysis and calculation of expression value

1. As described in Su et al., 2002, human tissue samples were hybridized on Affymetrix (one-color) arrays and chips were scanned. For each tissue, at least two independent samples were hybridized to separate chips.
2. Scanned images were quantified (including measurement of background) using standard software.

#### Part 0. Preprocessing and normalization of Affymetrix expression data

1. If you wish, you can skip these steps and download summarized (probeset-level) data.
2. Note that these analysis protocols are generally specific to Affymetric chips.
3. Data from a probeset (a series of oligos designed to a specific gene target) needs to be summarized by calculating an expression values for that probeset. This can be done using a Bioconductor/R version of the methods in the Microarray Suite 5.0 (MAS5, the standard Affymetrix algorithm) used by the Affymetrix software suite.
4. See the manuals from Affymetrix for more information about these processes, and the Statistical Algorithms Description Document for the actual equations used.
5. You also have the option of performing other algorithms such as RMA, GCRMA, and dChip.
6. Download the starting data ("CEL" files) at the top of the page and unzip this file to get 8 files ending in ".CEL".
7. Start R on your laptop by clicking on the "R" icon. You may also start it on tak. with the command /nfs/BaRC/R/R, but this doesn't give you any graphical interface, so learning R on a desktop computer is recommended.
8. If you're using your own computer, install Bioconductor (if you haven't done so before) by starting R and then pasting or typing these commands. The installation will take a few minutes.
```source("http://bioconductor.org/biocLite.R")
biocLite()```
9. Use the pull-down menu (File >> Change dir [Windows] or Misc >> Change working directory [Mac]) to go to the directory where you put the raw data (CEL files).
10. Load the "library" that contains the Affymetrix microarray code we'll want to use with the command
`library(affy)`
11. Read the CEL files (first command below) and then summarize and normalize with MAS5 (second command below). This could take a few minutes.
```affy.data = ReadAffy()
eset.mas5 = mas5(affy.data)```
12. The variable 'eset.mas5' contains normalized expression values for all probesets, along with other information. Let's continue by getting the expression matrix (probesets/genes in rows, chips in columns).
```exprSet.nologs = exprs(eset.mas5)
# List the column (chip) names
colnames(exprSet.nologs)
# Rename the column names if we want
colnames(exprSet.nologs) = c("brain.1", "brain.2",
"fetal.brain.1", "fetal.brain.2",
"fetal.liver.1", "fetal.liver.2",
"liver.1", "liver.2")```
13. At this time let's log-transform the expression values to get a more normal distribution. We have to remember we've done this when we calculate ratios. Logarithms can use any base, but base 2 is easiest when transforming ratios, since transformed 2-fold ratios up or down will be +1 or -1. As a result, we'll do all logs with base 2 to keep thing simplest.
14. `exprSet = log(exprSet.nologs, 2)`
15. Optional. In place of step 10 above, we could choose many other ways to preprocess the data, such as RMA (which outputs log2-transformed expression values)
`eset.rma = justRMA()`
or GCRMA (which also outputs log2-transformed expression values)
```library(gcrma)
eset.gcrma = justGCRMA()```
or dChip (also known as MBEI; not log-transformed)
```eset.dChip = expresso(affy.data, normalize.method="invariantset",
bg.correct=FALSE, pmcorrect.method="pmonly",summary.method="liwong")```
16. To print out our expression matrix (as with most data), we can use a command like
```write.table(exprSet, file="Su_mas5_matrix.txt", quote=F, sep="\t")
```
to get a tab-delimited file that we could view in Excel or a text editor.
17. While we're doing Affymetrix-specific preprocessing, let's calculate an Absent/Present call for each probeset.
```# Run the Affy A/P call algorithm on the CEL files we processed above
data.mas5calls = mas5calls(affy.data)
# Get the actual A/P calls
data.mas5calls.calls = exprs(data.mas5calls)
# Print the calls as a matrix
write.table(data.mas5calls.calls, file="Su_mas5calls.txt", quote=F, sep="\t")```

#### Part I. Normalization of expression data [Optional]

1. Why normalize? Chips may have been hybridized to different amounts of RNA, for different amounts of time, with different batches of solutions, etc. Normalization should remove systematic biases and make any comparisons between chips more meaningful.
2. If you performed Part 0 above, your data is already normalized. But if you got an unnormalized expression from another source, how could you normalize it?
4. Read an expression matrix in the form of a tab-delimited text file that you created or downloaded above. The first line contains column headers, and the first row (without a header field) contains gene identifiers.
`exprSetRaw = read.delim("Su_raw_matrix.txt")`
5. Calculate the trimmed mean of all expression values on each chip.
• A trimmed mean calculates a summary value that is somewhere between the mean and median of the set of values.
• We remove the top and bottom 2% of values (for example), and find the mean of the remaining values,
• To get the trimmed mean of column 1 on our matrix, we can use a command like
`trmean.col.1 = mean(exprSetRaw[,1], trim=0.02)`
• We can apply this command to all columns at once with the "apply" command, where the '2' means to run the command on all columns (and '1' would be to do the same thing on all rows).
```trmean = apply(exprSetRaw, 2, mean, trim=0.02)
trmean```
• If we're curious about the variation between genes on each chip, we can find the standard deviation for each chip
```sd = apply(exprSetRaw, 2, sd)
sd```
• Or we could compare medians
```median = apply(exprSetRaw, 2, median)
median```
6. Divide all expression values by the mean for that chip, and multiply by a scaling factor, such as the mean of the trimmed means.
```mean.of.trmeans = mean(trmean)
exprSet.trmean = exprSetRaw / trmean * mean.of.trmeans
```
7. Print the data that's been normalized by the global trimmed mean
```write.table(exprSet.trmean, file="Su_mas5_trmean_norm.txt", quote=F, sep="\t")
```
8. For further processing (Part II), rename your data
`exprSet = exprSet.trmean`
9. [Optional] Quantile normalization is often too extreme, but it's common for Affymetrix probe-level normalization (being part of MAS5). If you wish to use it, one way is to use a command from the "limma" package
```library(limma)
exprSet.quantile = normalizeQuantiles(exprSet)```
10. Another common normalization choice (generally for 2-color arrays) is loess. We combined the "brain.1" and "fetal.brain.1" expression data into a file that we can pretend is 2-color data (together with 0 values for "background"). Then we can normalize it with loess.
• Read this fake 2-color file, indicating the columns for the Red and Green channels and the Red and Green backgrounds.
```brain.fetalbrain.2color = read.maimages("brain.fetalbrain.2color.data.txt",
columns=list(G="brain.1", R="fetal.brain.1", Gb="bg1", Rb="bg2"))```
• Now we have the two channels of data which we can loess normalize
```brain.fetalbrain.2color.loess =
normalizeWithinArrays(brain.fetalbrain.2color, method="loess")```
• How do you know if anything changed? Print an MA plot before and after normalization
```# Set up a page with two figures next to each other
par(mfrow=c(1,2))
# Print the figures
plotMA(brain.fetalbrain.2color)
plotMA(brain.fetalbrain.2color.loess)```
• How do the figures differ?

#### Part II. Calculating log2 ratios

1. Calculate the mean of each pair of replicated experiments (converting 8 chips to 4 means).
2. ```brain.mean = apply(exprSet[, c("brain.1", "brain.2")], 1, mean)
fetal.brain.mean = apply(exprSet[, c("fetal.brain.1", "fetal.brain.2")], 1, mean)
liver.mean = apply(exprSet[, c("liver.1", "liver.2")], 1, mean)
fetal.liver.mean = apply(exprSet[, c("fetal.liver.1", "fetal.liver.2")], 1, mean)```
3. Calculate the ratios (for both brain and liver) of fetal tissue / adult tissue (converting 4 experiments to 2 ratios). Since we've already log-transformed the expression values, we need to use the fact that log(A / B) = log(A) - log(B)
```brain.fetal.to.adult = fetal.brain.mean - brain.mean
```all.data = cbind(exprSet, brain.mean, fetal.brain.mean, liver.mean, fetal.liver.mean,
`write.table(all.data, file="Microarray_Analysis_data_1_SOLUTION.txt", quote=F, sep="\t")`