Goals for class 3 exercises: Using Excel and R,
analyze the intersection of 2 sets with the Chi-squared test and/or Fisher’s exact test,
determine over-representation with the hypergeometric distribution,
analyze survival data with Kaplan Meier curves and the log-rank test, and
perform regression and correlation analysis.
Links to data for download are in double square brackets [[ like this ]]. Links to other pages are in double curly brackets {{ like this }}. |
Task | Using Excel | Using R | ||||||||||||
1: You perform expression analysis on two mouse models, each with a different gene involved in a favorite pathway. Mouse KO1 has 200 downregulated genes, and mouse KO2 has 160 downregulated genes. The intersection of these gene lists is 30 genes, and your microarray tested 10,000 genes. Is the intersection of these gene lists larger than you'd expect by chance? For a test of independence, one may use Fisher's test or the chi-squared test. | Start by creating a contigency table describing the data, which should look like this:
Excel has a function =CHITESTbut it requires that you have expected, as well as observed proportions for the test. Using our data, if the KO1 gene list was 200/10000 = 0.02 of all genes, and the KO2 gene list was 160/10000 = 0.016 of all genes, then the intersection would be expected to be 0.02 x 0.016 = 0.00032 or 3.2 genes. Modifying the matrix to require this value, it would become
Then we can use the CHITEST formula to compare this matrix to our observed matrix, with a command like =CHITEST(A1:B2,C1:D2)where the first set of cells points to the actual data and the second set points to the expected data. |
Start by creating a contigency table on paper describing the data, like
Then create this in R with a command like ko = matrix(c(30, 170, 130, 9670), ncol=2)If you want, you can even label the columns and rows: rownames(ko) = c("KO2_yes", "KO2_no") colnames(ko) = c("KO1_yes", "KO1_no")Check the table: koTo test for independence between KO1 and KO2 genes (rows and columns in the matrix) using Fisher's exact test, which calculates an exact p-value, use the command fisher.test(ko)To get an approximate p-value, try chisq.test(ko)To get the expected intersection size, try chisq.test(ko)$expected |
||||||||||||
2: We're interested in mesoderm development in our ES cells. The ES cells come from the genome of B. studious, which has 5000 annotated genes, 100 of them are involved in this process. We add a drug to these cells and find that 200 genes are downregulated, and of these, 15 are annotated as involved in mesoderm development. On the basis of these data, does this drug specifically influence mesoderm development? | In Excel, the command for each separate probability is
=HYPGEOMDIST(h, n, H, N)and then these could be added together but Excel appears to have trouble with this function. |
To test for enrichment in association with mesoderm development in our gene set, use
the hypergeometric distribution.
Use commands like h = 15 H = 100 n = 200 N = 5000 pvals = 1 - cumsum(dhyper(0:h, H, N-H, n) )where This calculates the probabilities of having 0 to h genes classified as mesoderm development, adds them up, and then subtracts the value from 1 to get the probability of more than h genes with this classification. Then get the final sum in the list, which shows the p-value reflecting confidence that there is enrichment: pvals[length(pvals)]If this p-value is less than our threshold (ex: 0.05), then we can confidently state that the downregulated gene set is enriched for genes involved in mesoderm devlopment. |
||||||||||||
3: Given [[ survival data ]], use these to calculate the survival function according to the method of Kaplan Meier and draw a figure. The data describe 48 rats that were injected with a carcinogen, and then given either a drug or placebo. Does the drug influence the survival of the rats? | Too much for Excel. |
Load the package:
library(survival)Either download the [[ survival data ]] and read it with rats = read.delim("rats.txt", header=T)or try to copy it from your browser into R directly with rats = read.delim(file("clipboard"), header=T)or try reading it from the web (but you may need to make the following web address all one line) with rats = read.delim("http://iona.wi.mit.edu/ bio/education/stats2007/exercises/data/rats.txt", header=T)Look at the data to get the headings. Compute a survival curve for all the data and plot it: my.surv.all = survfit(Surv(days, dead), data=rats) plot(my.surv.all)Does the drug influence survival? Describe the survival function under the influence of the drug ('~' means "described by")? my.surv.by.group = survfit(Surv(days, dead) ~ drug, data=rats) plot(my.surv.by.group)Does giving the drug change the survival function? survdiff(Surv(days, dead) ~ drug, data=rats) |
||||||||||||
4: Given [[ a set of data ]] showing a potential effect of blood glucose level on the rate of ventricular (heart) contraction, determine if this effect is observable. If so, create a model of contraction rate as a function of glucose. Finally, how much of the variation in heart contraction can be explained by the effect of glucose concentration. |
Open the file heart.txt in Excel or paste the data from heart.html into Excel.
Start with a scatterplot of contraction velocity vs. time. Using the Chart Wizard icon, select XY(Scatter) >> Next, select the two columns of data and click on "Finish". What type of relatioship between variables appears to be present. To perform a regression, go to Tools >> Data Analysis, and then select Regression. To select input data, note that the input Y range ("velocity" in our data) comes first, followed by the input X range. Feel free to check every box except "Constant is Zero" unless you have reason to believe that the y-intercept should be 0. What does the output mean? What is the slope and y-intercept of the regression line? Can you be confident that the slope is not 0? How much of the variation in velocity can be explained by variation in glucose? |
Read the data in R and plot the data:
plot(heart$glucose, heart$velocity)Can you redo the plot with a title and labelled axes? Try ?plotto find the way to do this. Perform the linear regression analysis, using a model that attempts to describe the dependence of velocity on glucose: lm.v.g = lm(velocity ~ glucose, data=heart) summary(lm.v.g)Then draw the points and the regression line together: plot(heart$glucose, heart$velocity, pch=19, cex=2, col="orange", main="Regression line: velocity ~ glucose") lines(heart$glucose, fitted(lm.v.g), lwd=2, col="blue")In R you can label points as you click on them with "identify". For example, while the previous plot is open, type identify(heart$glucose, heart$velocity, paste(heart$glucose, ",", heart$velocity, sep=""))Click on a point on the graph and see what happens. If you're using R on a Mac, you may need to go back to the main R window and hit enter. When you're finished, right click on the figure (Unix R), click on "Stop" (Windows R), or just close the figure. |
||||||||||||
5: Calculate the correlation between [[ gene expression profiles ]] to identify genes with the closest profile to the gene you've selected. These data show rows of genes and columns of cell types. |
Open the file expression_all.txt in Excel or paste the data from expression_all.html into Excel.
If you started with "heart.txt", you may have to shift the header to the right by one cell. Select a gene from the list and place it at the second row of the matrix (just under the header). Calculate the correlation of this gene's profile with all the other profiles to the right of the matrix, starting with the next row. Use a command like =CORREL($B$2:$N$2,$B3:$N3)Copy that cell and paste it down the column. To preserve the values after sorting, select the column with the correlation data, copy it, and over top, do Paste Special >> Values Sort the correlation column in ascending order. What genes are most positively correlated with yours? Are any negatively correlated with yours? What could explain that? |
To measure correlation, use a command like
cor(x, y)To set this up, edit "expression_all.txt" as you did for Excel: place your selected gene in the first row. Since we'd like to calculation the correlation between your gene
and every other gene in the matrix, we could do something like this:
If you're interested, draw a figure with your selected gene and the 4 most highest correlated genes # Format the image, making it a 5 x 1 layout par(mfrow= c(5, 1), omi=c(0.7, 0.2, 0.2, 0.05), mai=c(0.05, 0.05, 0.2, 0.05)) for (i in 1:4) # Loop through the first 5 genes { # Draw a barplot for each gene barplot(as.numeric(exp.sorted[i,]), col=rainbow(ncol(exp)), main=rownames(exp.sorted)[i], ) } i=5 # Do the last gene with tissue labels barplot(as.numeric(exp.sorted[i,]), col=rainbow(ncol(exp)), main=rownames(exp.sorted)[i], names.arg=colnames(exp), las=2, cex.axis=0.5) |
||||||||||||
6: Exit from applications | You may wish to save your file and/or email it to yourself for reference. |
To get all of your work (input and text output) look for the "Save to file" choice on the menu.
You can save the "history" output (during or at the end of your work) with a command like savehistory(file = "R_history_class3.R") To exit from R, use the pull-down menu or type q()for "quit". |
Dataset 1 for exercises above (truncated after 20 lines): survival data for rats injected with a carcinogen and then treated either with a drug or placebo (tab-delimited text) |
|
Dataset 2 for exercises above: blood glucose and ventricular shortening velocity for type 1 diabetic patients. (tab-delimited text) |
|
Dataset 3 for exercises above (truncated after 10 rows): expression data for 10 genes across tissue types (tab-delimited text) |
|