WIBR Bioinformatics and Research Computing

Statistics for Biologists (November 2007)

URL: http://iona.wi.mit.edu/bio/education/stats2007/

Exercises for lecture 2

Goals for class 2 exercises: (1) Explore the R Commander, a clever graphical interface. (2) Using Excel and R, perform some statistical tests for pairwise comparisons, do some power calculations, and perform two types of multiple hypothesis testing.

Links to data for download are in double square brackets [[ like this ]]. Links to other pages are in double curly brackets {{ like this }}.


Start R on your desktop or laptop by clicking on the icon. To use R Commander (or R in general), you can also {{ connect to R on tak }}

Explore the R Commander

TaskUsing R Commander
1: Start R on your laptop or {{ R on tak }}, load Rcmdr and then load and view some data.

Rcmdr can be installed on desktop operating systems but it's not very straightforward (unlike most R packages). As a result, we show you also how to connect to R on tak (our big Linux computer).

If you do want to try to install it on your own desktop computer, start R and use the commmand

install.packages("Rcmdr", dependencies=TRUE)
Start R. If you're on a Mac, also start X11 but then go back to R.

Load the package by typing

library(Rcmdr)
If you get an error, Rcmdr may not be installed.

Then read a file:

Data >> Data in packages >> Read data set from an attached package

Double click on "datasets". Double click on "ChickWeight". To get some information about this "Active data set",

Data >> Active data set >> Help on active data set

Then click on "View data set" to get a spreadsheet view.
2: Draw a boxplot. The chicks are in multiple groups, so it's somewhat tricky to view the data most appropriately. For a look at the distribution of the chick weights with a boxplot (divided into groups), try

Graphs >>
Boxplot >>
Plot by groups >>
click on "Diet" >>
OK >>
Select "weight" >>
OK

3: Draw a scatterplot. For a look at all of their weight data with a scatterplot, try

Graphs >>
Scatterplot >>
Plot by groups >>
click on "Diet" >>
Check "Plot lines by group" >>
OK >>
for x-variable select "Time" >>
for y-variable select "weight" >>
OK

4: Explore the R Commander. Note that all executed commands are printed in the Script Window. If commands (like Statistics >> Summaries >> Active data set) generate text output, it appears in the Output Window. Also notice that you can type a command (like ?boxplot) in the Script Window and click on "Submit" (on the right in the middle) to execute this last command.

Perform pairwise comparisons, power calculations, and multiple hypothesis testing

TaskUsing ExcelUsing R
1: Select (right click) and download data about [[ Energy expenditure by lean and obese women ]].

If that isn't available, you can copy the data from the bottom of page, paste into a text editor, and save the file as plain text.

Open the tab-delimited text file energy.txt you should have downloaded (from the "Task" column at left) or paste the data from [[ energy.html ]] into Excel. After starting R, move R to the directory you want by using the pull-down menu:

For Windows: File >> Change dir >> Browse >> [Browse to Desktop] >> OK

For Mac: Misc >> Change dir >> Browse >> [Browse to Desktop] >> OK

If you don't want to use the pull-down menu, you can move the data file into working directory ("wd"). To find the working directory, use the command
getwd()

Then you can change (set) the working directory to match the location of data with a command like

setwd("C:/Documents and Settings/student/Desktop")
or
setwd("/Users/student/Desktop")
assuming "student" is your laptop username.

Then read the file by typing a command like

energy = read.delim("energy.txt", header=T)
(and hitting "enter").

To see the "energy" dataset, either type

energy
or
edit(energy)
With the "edit" view, if fields are changed, energy actually stays the same. To change the dataset (not now!) use a command like "fix(energy)".
2: Are the means of these data sets different? Use a t-test to help you answer the question. Since we're testing for inequality, we'll need to use the two-tailed version. Also, unless we know otherwise, we may not want to assume the variances for each data set are equal. In that case we should use the modification of the t test (Welch's) that doesn't assume equal variances. Use a formula like
=TTEST(A2:A14,B2:B10,2,3)
where the first two fields are the data sets we're comparing. (Look at TTEST in Excel help to get the syntax and an explanation.) The third field, "tails", (2) indicates the two-tailed test, and the fourth field (3) indicates that we're performing a "Two-sample unequal variance" test.

Another choice is go to

Tools >> Data Analysis

After selecting the desired type of t-test (near the bottom of the list), input fields should be easy to figure out. The output for this version of the t-test has a lot more information than just the p-value.

Use a command like
t.test(energy$lean, energy$obese)
To get the syntax, options, and defaults, type
?t.test
The help page for t.test shows that the default command uses a two-sided (two-tailed) test for samples with unequal variances. Note that the p-value should agree with the Excel value, but also look at the other output data provided: Other important information is the means of the two samples and the 95% confidence interval (CI) for the difference of the means. (If the CI includes 0, then the p-value will be greater than 0.05). If you wanted to access only some of this data, type something like
lean.vs.obese = t.test(energy$lean, energy$obese)
names(lean.vs.obese)
and then you'll get the available fields. To just get the p-value, for example, type
lean.vs.obese$p.value
3: For a t-test or another reason, how can you find out if the variances of two samples are equal? To help answer the question, use the F test, which tests if the ratio of the variances is 1. Use a formula like
=FTEST(A2:A12,B2:B12)
For the energy data, you should get a high p-value that doesn't show that the variances are different.
Use a command like
var.test(energy$lean, energy$obese)
which performs the F test.
4: How can you find out how many replicates you need to get a desired statistical power for a test? Or given the number of replicates, how sure can you be about detecting small differences between means? You can enter the formulas for power into Excel, but there's no quick way to do the calculation. For power calculations for t-test, the command is
power.t.test
You enter all the required data (n (sample size), difference in means (delta), standard deviation, significance level (alpha), power) -- except one -- and the missing piece of data is calculated. Using the energy data, the standard deviation is about 1.3. Assuming your p-value cutoff is 0.05, the desired power is 0.9, and you have 9 replicates for each sample, how small a difference between means could you detect?
power.t.test(n=9, delta=NULL, sd=1.3,
sig.level=0.05, power=0.9)
On the other hand, if you want to detect a difference of 1.5, how many replicates should you do?
power.t.test(n=NULL, delta=1.5, sd=1.3,
sig.level=0.05, power=0.9)
Doing a series of these calculations (and knowing the expense/time of additional replicates) may help you more effectively design an experiment.
5: How can you compare means of two samples using a nonparametric test? This could be recommended if the distributions of the data are far from normal -- one of the requirements of a t-test. Like all nonparametric tests, the Wilcoxon signed rank test compares ranks rather than actual values. The Wilcoxon signed rank test (Mann Whitney test) is not available in Excel. Use a command like
wilcox.test(energy$lean, energy$obese)
You may get a warning if two values are identical (like 7.48 in the energy data), since it can be tricky to calculate with tied ranks. The warning indicates that the p-value generated by the test is approximate.
6: Select (right click) and download data about [[ energy intake ]] for women pre- and postmenstrually the same data at the bottom of the page.

Are the means of these data sets different? These 11 sets of data were collected on the same 11 women, so this is a clear case when the paired t-test should be used.

Open the file intake_paired.txt in Excel or paste the data from intake_paired.html into Excel.

Use a formula like

=TTEST(A2:A12,B2:B12,2,1)
The fourth field (1) indicates that we're performing a paired test. If you were to try this as the usual unpaired t-test (with "3" for the fourth field), the p-value is very different (and wrong for these data).
Read the table with a command like
intake = read.delim("intake_paired.txt", header=T)
Another way to try is open the file in Excel, select and copy the energy intake data, and then use the command
intake = read.delim(file("clipboard"), header=T) 
You might want to check that you got it right with
edit(intake)
To perform the test, use a command like
t.test(intake$pre, intake$post, paired=TRUE)
7: Correct for multiple hypothesis testing by FWER (the Bonferroni correction).

Start by downloading [[ expression data ]], showing expression levels for ten genes in 6 replicates each of tissues "A" and "B" (or get it below)

Perform a t-test comparing levels across tissues, and then correct the p-values.

Open the file expression.txt in Excel or paste the data from expression.html into Excel.

Create a new column on the right called "p-value", and get p-values from a t-test using a command like

=TTEST(B2:G2,H2:M2,2,3)
Then copy the formula down the column.

Count the number of genes tested in this experment (rows in the table). Multiply each p-value by this number, but don't go higher than 1. This can be done with a formula like

=MIN(N2*10,1)
where cell N2 contains the raw p-value. Copy this formula down the column.
Read the table with a command like
exp = read.delim("expression.txt", h=T)
One way to do t-tests for each row and then correct the p-values is like this:
# Get the number of rows (genes)
num.genes = nrow(exp)
# Make a matrix of "num.genes" rows and 2 columns,
# filled with 0's
p.vals = matrix(0, num.genes, 2)
# Name the columns
colnames(p.vals) = c("raw.p.value",
	"Bonferroni.p.values")
# Count from 1 to "num.genes"
for (i in 1:num.genes)
{
	# Calculate the p-value from the t-test
	p.vals[i,1] = t.test(exp[i,2:7],
		exp[i,8:13])$p.value
	# Multiply p-value by "num.genes"
	p.vals[i,2] = min (p.vals[i,1] 
		* num.genes, 1)
}
# Add new columns to expression table
exp.p = cbind(exp, p.vals)
Okay, so some things are easier to do in Excel.

To see the output, remember that you can use a command like

edit(exp.p)
or print the output to a file:
write.table(exp.p, file = "pvals_Bonferroni.txt",
sep="\t", quote=FALSE)
8: Correct for multiple hypothesis testing by FDR (the False Discovery Rate)

This can now be done with a {{ web tool }} too.

Can be done in Excel, but R is actually easier.

If you really want to, sort p-values in increasing order. The last p-value in the list stays the same. The p-value one row above is calculated with a formula like

=MIN(N10*(10/RANK(N10,$N$2:
$N$11,1)),O11)
where N10 contains the current raw p-value, N2 through N11 contain all the p-values, and O11 is the previous (one row below) corrected p-value.

Copy the formula up the column.

Using the previous table you created, grab the column with the raw p-values:
raw.pvals = exp.p$raw.p.value

With one command, you can do a multiple hypothesis correction:

p.adjust(raw.pvals, method)
where the 'method' is one of six types: "fdr", "holm", "hochberg", "hommel", "bonferroni", "BH", and "BY". To get more information about these methods, get command documentation by typing
?p.adjust
We'll be using FDR (using the method of Benjamini and Hochberg), so type
FDR.p.values = p.adjust(raw.pvals, "fdr")
And now, merge these new columns with your previous table.
data.adjp = cbind(exp.p, FDR.p.values)
The FDR corrections are in the last column. Note how different the p-values are between the raw values, the Bonferroni-corrected values, and the FDR-corrected values.

And finally, print all the data to a file:

write.table(data.adjp, file = 
"pvals_corrected_all.txt", sep="\t", 
quote=FALSE)
9: Exit from applications You may wish to save your file and/or email it to yourself for reference. To get all of your work (input and text output) look for the "Save to file" choice on the menu.

You can save the "history" output (during or at the end of your work) with a command like

savehistory(file = "R_history_class2.R")

To exit from R, use the pull-down menu or type

q()
for "quit".
Dataset 1 for exercises above: energy expenditure in groups of lean and obese women (tab-delimited text)
lean	obese
7.53	9.21
7.48	11.51
8.08	12.79
8.09	11.85
10.15	9.97
8.4	8.79
10.88	9.69
6.13	9.68
7.9	9.19
7.05
7.48
7.58
8.11
Dataset 2 for exercises above: paired energy intake for women pre- and postmenstrually (tab-delimited text)
pre	post
5260	3910
5470	4220
5640	3885
6180	5160
6390	5645
6515	4680
6805	5265
7515	5975
7515	6790
8230	6900
8770	7335
Dataset 3 for exercises above: expression data for 10 genes across two tissue types (tab-delimited text)
Gene	A1	A2	A3	A4	A5	A6	B1	B2	B3	B4	B5	B6
Gene1	353	379	524	383	278	291	648	242	779	1307	567	490
Gene2	159	60	156	97	28	80	90	168	124	133	182	182
Gene3	1861	1603	2460	1486	1526	939	2259	1844	2345	1944	1225	2370
Gene4	171	80	397	75	44	113	286	464	525	460	258	429
Gene5	447	364	627	221	263	132	81	163	286	135	111	73
Gene6	3767	2810	2206	3723	4071	1379	624	680	1033	1687	1367	2194
Gene7	1915	3120	4072	2120	325	1210	1411	1007	2080	830	500	616
Gene8	20	59	213	20	35	286	20	20	195	406	366	1238
Gene9	835	935	1665	764	1323	1030	3237	1125	4655	3807	3593	2701
Gene10	516	702	423	388	569	60	269	339	188	94	171	139

Questions or comments?   wibr-bioinformatics@wi.mit.edu
Bioinformatics and Research Computing at Whitehead Institute