WIBR Bioinformatics and Research Computing

Statistics for Biologists (November 2007)

URL: http://iona.wi.mit.edu/bio/education/stats2007/

Exercises for lecture 3

Goals for class 3 exercises: Using Excel and R, analyze the intersection of 2 sets with the Chi-squared test and/or Fisher’s exact test, determine over-representation with the hypergeometric distribution, analyze survival data with Kaplan Meier curves and the log-rank test, and perform regression and correlation analysis.

Links to data for download are in double square brackets [[ like this ]]. Links to other pages are in double curly brackets {{ like this }}.



Perform selected biostatistical applications

TaskUsing ExcelUsing R
1: You perform expression analysis on two mouse models, each with a different gene involved in a favorite pathway. Mouse KO1 has 200 downregulated genes, and mouse KO2 has 160 downregulated genes. The intersection of these gene lists is 30 genes, and your microarray tested 10,000 genes. Is the intersection of these gene lists larger than you'd expect by chance? For a test of independence, one may use Fisher's test or the chi-squared test. Start by creating a contigency table describing the data, which should look like this:
30130
1709670

Excel has a function

=CHITEST
but it requires that you have expected, as well as observed proportions for the test.

Using our data, if the KO1 gene list was 200/10000 = 0.02 of all genes, and the KO2 gene list was 160/10000 = 0.016 of all genes, then the intersection would be expected to be 0.02 x 0.016 = 0.00032 or 3.2 genes.

Modifying the matrix to require this value, it would become
3.2156.8
196.89643.2
(which maintains the size of each gene list).

Then we can use the CHITEST formula to compare this matrix to our observed matrix, with a command like

=CHITEST(A1:B2,C1:D2)
where the first set of cells points to the actual data and the second set points to the expected data.
Start by creating a contigency table on paper describing the data, like
30130
1709670
The transposed version of this matrix is equally correct.

Then create this in R with a command like

ko = matrix(c(30, 170, 130, 9670), ncol=2)
If you want, you can even label the columns and rows:
rownames(ko) = c("KO2_yes", "KO2_no")
colnames(ko) = c("KO1_yes", "KO1_no")
Check the table:
ko
To test for independence between KO1 and KO2 genes (rows and columns in the matrix) using Fisher's exact test, which calculates an exact p-value, use the command
fisher.test(ko)
To get an approximate p-value, try
chisq.test(ko)
To get the expected intersection size, try
chisq.test(ko)$expected
2: We're interested in mesoderm development in our ES cells. The ES cells come from the genome of B. studious, which has 5000 annotated genes, 100 of them are involved in this process. We add a drug to these cells and find that 200 genes are downregulated, and of these, 15 are annotated as involved in mesoderm development. On the basis of these data, does this drug specifically influence mesoderm development? In Excel, the command for each separate probability is
=HYPGEOMDIST(h, n, H, N)
and then these could be added together but Excel appears to have trouble with this function.
To test for enrichment in association with mesoderm development in our gene set, use the hypergeometric distribution.

Use commands like

h = 15
H = 100
n = 200
N = 5000
pvals = 1 - cumsum(dhyper(0:h, H, N-H, n) )
where
  • h = number of hits in the gene set (the sample)
  • H = number of hits in the genome (the population)
  • n = number of objects (genes) in the gene set
  • N = number of genes in the genome

    This calculates the probabilities of having 0 to h genes classified as mesoderm development, adds them up, and then subtracts the value from 1 to get the probability of more than h genes with this classification.

    Then get the final sum in the list, which shows the p-value reflecting confidence that there is enrichment:

    pvals[length(pvals)]
    If this p-value is less than our threshold (ex: 0.05), then we can confidently state that the downregulated gene set is enriched for genes involved in mesoderm devlopment.
  • 3: Given [[ survival data ]], use these to calculate the survival function according to the method of Kaplan Meier and draw a figure. The data describe 48 rats that were injected with a carcinogen, and then given either a drug or placebo. Does the drug influence the survival of the rats? Too much for Excel. Load the package:
    library(survival)
    Either download the [[ survival data ]] and read it with
    rats = read.delim("rats.txt", header=T)
    or try to copy it from your browser into R directly with
    rats = read.delim(file("clipboard"), header=T)
    or try reading it from the web (but you may need to make the following web address all one line) with
    rats = read.delim("http://iona.wi.mit.edu/
    bio/education/stats2007/exercises/data/rats.txt",
    header=T)
    Look at the data to get the headings. Compute a survival curve for all the data and plot it:
    my.surv.all = survfit(Surv(days, dead), data=rats)
    plot(my.surv.all)
    Does the drug influence survival? Describe the survival function under the influence of the drug ('~' means "described by")?
    my.surv.by.group =
    survfit(Surv(days, dead) ~ drug, data=rats)
    plot(my.surv.by.group)
    Does giving the drug change the survival function?
    survdiff(Surv(days, dead) ~ drug, data=rats)
    4: Given [[ a set of data ]] showing a potential effect of blood glucose level on the rate of ventricular (heart) contraction, determine if this effect is observable. If so, create a model of contraction rate as a function of glucose. Finally, how much of the variation in heart contraction can be explained by the effect of glucose concentration. Open the file heart.txt in Excel or paste the data from heart.html into Excel.

    Start with a scatterplot of contraction velocity vs. time. Using the Chart Wizard icon, select XY(Scatter) >> Next, select the two columns of data and click on "Finish". What type of relatioship between variables appears to be present.

    To perform a regression, go to Tools >> Data Analysis, and then select Regression. To select input data, note that the input Y range ("velocity" in our data) comes first, followed by the input X range. Feel free to check every box except "Constant is Zero" unless you have reason to believe that the y-intercept should be 0.

    What does the output mean? What is the slope and y-intercept of the regression line? Can you be confident that the slope is not 0? How much of the variation in velocity can be explained by variation in glucose?

    Read the data in R and plot the data:
    plot(heart$glucose, heart$velocity)
    Can you redo the plot with a title and labelled axes? Try
    ?plot
    to find the way to do this.

    Perform the linear regression analysis, using a model that attempts to describe the dependence of velocity on glucose:

    lm.v.g = lm(velocity ~ glucose, data=heart)
    summary(lm.v.g)
    
    Then draw the points and the regression line together:
    plot(heart$glucose, heart$velocity, pch=19, 
    cex=2, col="orange", main="Regression line: velocity ~ glucose")
    lines(heart$glucose, fitted(lm.v.g), lwd=2, col="blue")
    
    In R you can label points as you click on them with "identify". For example, while the previous plot is open, type
    identify(heart$glucose, heart$velocity, 
    paste(heart$glucose, ",", heart$velocity, sep=""))
    Click on a point on the graph and see what happens. If you're using R on a Mac, you may need to go back to the main R window and hit enter.

    When you're finished, right click on the figure (Unix R), click on "Stop" (Windows R), or just close the figure.

    5: Calculate the correlation between [[ gene expression profiles ]] to identify genes with the closest profile to the gene you've selected. These data show rows of genes and columns of cell types. Open the file expression_all.txt in Excel or paste the data from expression_all.html into Excel.

    If you started with "heart.txt", you may have to shift the header to the right by one cell.

    Select a gene from the list and place it at the second row of the matrix (just under the header). Calculate the correlation of this gene's profile with all the other profiles to the right of the matrix, starting with the next row. Use a command like

    =CORREL($B$2:$N$2,$B3:$N3)
    Copy that cell and paste it down the column. To preserve the values after sorting, select the column with the correlation data, copy it, and over top, do Paste Special >> Values

    Sort the correlation column in ascending order. What genes are most positively correlated with yours? Are any negatively correlated with yours? What could explain that?

    To measure correlation, use a command like
    cor(x, y)
    To set this up, edit "expression_all.txt" as you did for Excel: place your selected gene in the first row.

    Since we'd like to calculation the correlation between your gene and every other gene in the matrix, we could do something like this:

    exp = read.delim("expression_all.txt", h=T)
    nrow = nrow(exp)	# count the rows in your data
    ncol = ncol(exp)	# count the columns in your data
    
    # Create an empty column to hold the correlation data
    corrs = matrix(data=NA, nrow=nrow, ncol=1)
    # Label the rows of the correlation data
    #	with the gene names
    rownames(corrs) = rownames(exp)
    
    # Get the list of expression values for your gene
    my.gene = as.numeric(exp[1,])
    
    # Go through rows 1 to nrow (the last row)
    for (i in 1:nrow)
    {
    	# Get the list of expression values for this gene
    	other.gene = as.numeric(exp[i,1:ncol])
    	# Calculate the correlation between expression
    	#     profiles and round off answer
    	corrs[i,1] = round(cor(other.gene, my.gene), 4)
    }
    # Get the row order by descreasing correlation
    order.by.corr = order(corrs[,1], decreasing=T)
    # Sort the genes by decreasing correlation
    corrs.sorted = corrs[order.by.corr,]
    # Sort the expression data by decreasing correlation
    exp.sorted = exp[order.by.corr ,]

    If you're interested, draw a figure with your selected gene and the 4 most highest correlated genes

    # Format the image, making it a 5 x 1 layout
    par(mfrow= c(5, 1), omi=c(0.7, 0.2, 0.2, 0.05), 
    mai=c(0.05, 0.05, 0.2, 0.05))
    for (i in 1:4) # Loop through the first 5 genes
    {
          # Draw a barplot for each gene
          barplot(as.numeric(exp.sorted[i,]), col=rainbow(ncol(exp)),
          main=rownames(exp.sorted)[i], )
    }
    i=5	# Do the last gene with tissue labels
    barplot(as.numeric(exp.sorted[i,]), col=rainbow(ncol(exp)),
          main=rownames(exp.sorted)[i], names.arg=colnames(exp), 
    	  las=2, cex.axis=0.5)
    
    6: Exit from applications You may wish to save your file and/or email it to yourself for reference. To get all of your work (input and text output) look for the "Save to file" choice on the menu.

    You can save the "history" output (during or at the end of your work) with a command like

    savehistory(file = "R_history_class3.R")

    To exit from R, use the pull-down menu or type

    q()
    for "quit".
    Dataset 1 for exercises above (truncated after 20 lines): survival data for rats injected with a carcinogen and then treated either with a drug or placebo (tab-delimited text)
    litter	days	dead	drug
    1	101	0	no
    1	49	1	yes
    1	104	0	yes
    2	104	0	no
    2	102	0	yes
    2	104	0	yes
    3	104	0	no
    3	104	0	yes
    3	104	0	yes
    4	77	0	no
    4	97	0	yes
    4	79	0	yes
    5	89	0	no
    5	104	0	yes
    5	104	0	yes
    6	88	1	no
    6	96	1	yes
    6	104	0	yes
    7	104	1	no
    7	94	0	yes
    Dataset 2 for exercises above: blood glucose and ventricular shortening velocity for type 1 diabetic patients. (tab-delimited text)
    glucose	velocity
    15.3	1.76
    10.8	1.34
    8.1	1.27
    19.5	1.47
    7.2	1.27
    5.3	1.49
    9.3	1.31
    11.1	1.09
    7.5	1.18
    12.2	1.22
    6.7	1.25
    5.2	1.19
    19	1.95
    15.1	1.28
    6.7	1.52
    4.2	1.12
    10.3	1.37
    12.5	1.19
    16.1	1.05
    13.3	1.32
    4.9	1.03
    8.8	1.12
    9.5	1.7
    Dataset 3 for exercises above (truncated after 10 rows): expression data for 10 genes across tissue types (tab-delimited text)
    Heart	Kidney	Liver	Lung	Ovary	Pancr	Prost	S_cord	Testis	Thala	Thy	Blood	Brain
    AANAT	526	715	877	273	263	951	367	284	311	446	191	488	400
    ABCB10	41	42	16	48	75	640	61	28	23	47	40	91	25
    ABCC5	121	188	106	133	184	276	129	225	273	280	62	129	367
    ABI-2	1	35	1	41	93	1	95	134	88	128	122	1	124
    ACADM	169	1146	55	307	497	388	493	434	661	487	537	130	275
    ACADVL	874	1134	1114	348	593	2939	553	518	424	508	1004	275	342
    ACATN	115	441	153	188	151	473	199	167	165	80	271	271	62
    ACINUS	478	428	396	464	495	842	462	520	573	471	562	839	560
    ACR	317	464	492	347	213	971	327	193	1390	143	148	426	212
    ACRV1	1	1	1	1	1	61	1	1	523	1	1	1	1

    Questions or comments?   wibr-bioinformatics@wi.mit.edu
    Bioinformatics and Research Computing at Whitehead Institute