### Exercises for lecture 2

 Goals for class 2 exercises: (1) Explore the R Commander, a clever graphical interface. (2) Using Excel and R, perform some statistical tests for pairwise comparisons, do some power calculations, and perform two types of multiple hypothesis testing. Links to data for download are in double square brackets [[ like this ]]. Links to other pages are in double curly brackets {{ like this }}.

 Start R on your desktop or laptop by clicking on the icon. To use R Commander (or R in general), you can also {{ connect to R on tak }}

### Explore the R Commander

 Task Using R Commander 1: Start R on your laptop or {{ R on tak }}, load Rcmdr and then load and view some data. Rcmdr can be installed on desktop operating systems but it's not very straightforward (unlike most R packages). As a result, we show you also how to connect to R on tak (our big Linux computer). If you do want to try to install it on your own desktop computer, start R and use the commmand `install.packages("Rcmdr", dependencies=TRUE)` Start R. If you're on a Mac, also start X11 but then go back to R. Load the package by typing`library(Rcmdr)` If you get an error, Rcmdr may not be installed. Then read a file: Data >> Data in packages >> Read data set from an attached package Double click on "datasets". Double click on "ChickWeight". To get some information about this "Active data set", Data >> Active data set >> Help on active data set Then click on "View data set" to get a spreadsheet view. 2: Draw a boxplot. The chicks are in multiple groups, so it's somewhat tricky to view the data most appropriately. For a look at the distribution of the chick weights with a boxplot (divided into groups), try Graphs >> Boxplot >> Plot by groups >> click on "Diet" >> OK >> Select "weight" >> OK 3: Draw a scatterplot. For a look at all of their weight data with a scatterplot, try Graphs >> Scatterplot >> Plot by groups >> click on "Diet" >> Check "Plot lines by group" >> OK >> for x-variable select "Time" >> for y-variable select "weight" >> OK 4: Explore the R Commander. Note that all executed commands are printed in the Script Window. If commands (like Statistics >> Summaries >> Active data set) generate text output, it appears in the Output Window. Also notice that you can type a command (like ?boxplot) in the Script Window and click on "Submit" (on the right in the middle) to execute this last command.

### Perform pairwise comparisons, power calculations, and multiple hypothesis testing

 Task Using Excel Using R 1: Select (right click) and download data about [[ Energy expenditure by lean and obese women ]]. If that isn't available, you can copy the data from the bottom of page, paste into a text editor, and save the file as plain text. Open the tab-delimited text file energy.txt you should have downloaded (from the "Task" column at left) or paste the data from [[ energy.html ]] into Excel. After starting R, move R to the directory you want by using the pull-down menu: For Windows: File >> Change dir >> Browse >> [Browse to Desktop] >> OK For Mac: Misc >> Change dir >> Browse >> [Browse to Desktop] >> OK If you don't want to use the pull-down menu, you can move the data file into working directory ("wd"). To find the working directory, use the command `getwd()` Then you can change (set) the working directory to match the location of data with a command like `setwd("C:/Documents and Settings/student/Desktop")` or `setwd("/Users/student/Desktop")` assuming "student" is your laptop username. Then read the file by typing a command like `energy = read.delim("energy.txt", header=T)` (and hitting "enter"). To see the "energy" dataset, either type `energy` or `edit(energy)` With the "edit" view, if fields are changed, energy actually stays the same. To change the dataset (not now!) use a command like "fix(energy)". 2: Are the means of these data sets different? Use a t-test to help you answer the question. Since we're testing for inequality, we'll need to use the two-tailed version. Also, unless we know otherwise, we may not want to assume the variances for each data set are equal. In that case we should use the modification of the t test (Welch's) that doesn't assume equal variances. Use a formula like `=TTEST(A2:A14,B2:B10,2,3)` where the first two fields are the data sets we're comparing. (Look at TTEST in Excel help to get the syntax and an explanation.) The third field, "tails", (2) indicates the two-tailed test, and the fourth field (3) indicates that we're performing a "Two-sample unequal variance" test. Another choice is go to Tools >> Data Analysis After selecting the desired type of t-test (near the bottom of the list), input fields should be easy to figure out. The output for this version of the t-test has a lot more information than just the p-value. Use a command like `t.test(energy\$lean, energy\$obese)` To get the syntax, options, and defaults, type `?t.test` The help page for t.test shows that the default command uses a two-sided (two-tailed) test for samples with unequal variances. Note that the p-value should agree with the Excel value, but also look at the other output data provided: Other important information is the means of the two samples and the 95% confidence interval (CI) for the difference of the means. (If the CI includes 0, then the p-value will be greater than 0.05). If you wanted to access only some of this data, type something like ```lean.vs.obese = t.test(energy\$lean, energy\$obese) names(lean.vs.obese) ``` and then you'll get the available fields. To just get the p-value, for example, type `lean.vs.obese\$p.value` 3: For a t-test or another reason, how can you find out if the variances of two samples are equal? To help answer the question, use the F test, which tests if the ratio of the variances is 1. Use a formula like `=FTEST(A2:A12,B2:B12)` For the energy data, you should get a high p-value that doesn't show that the variances are different. Use a command like `var.test(energy\$lean, energy\$obese)` which performs the F test. 4: How can you find out how many replicates you need to get a desired statistical power for a test? Or given the number of replicates, how sure can you be about detecting small differences between means? You can enter the formulas for power into Excel, but there's no quick way to do the calculation. For power calculations for t-test, the command is `power.t.test` You enter all the required data (n (sample size), difference in means (delta), standard deviation, significance level (alpha), power) -- except one -- and the missing piece of data is calculated. Using the energy data, the standard deviation is about 1.3. Assuming your p-value cutoff is 0.05, the desired power is 0.9, and you have 9 replicates for each sample, how small a difference between means could you detect? ```power.t.test(n=9, delta=NULL, sd=1.3, sig.level=0.05, power=0.9)``` On the other hand, if you want to detect a difference of 1.5, how many replicates should you do? ```power.t.test(n=NULL, delta=1.5, sd=1.3, sig.level=0.05, power=0.9)``` Doing a series of these calculations (and knowing the expense/time of additional replicates) may help you more effectively design an experiment. 5: How can you compare means of two samples using a nonparametric test? This could be recommended if the distributions of the data are far from normal -- one of the requirements of a t-test. Like all nonparametric tests, the Wilcoxon signed rank test compares ranks rather than actual values. The Wilcoxon signed rank test (Mann Whitney test) is not available in Excel. Use a command like `wilcox.test(energy\$lean, energy\$obese)` You may get a warning if two values are identical (like 7.48 in the energy data), since it can be tricky to calculate with tied ranks. The warning indicates that the p-value generated by the test is approximate. 6: Select (right click) and download data about [[ energy intake ]] for women pre- and postmenstrually the same data at the bottom of the page. Are the means of these data sets different? These 11 sets of data were collected on the same 11 women, so this is a clear case when the paired t-test should be used. Open the file intake_paired.txt in Excel or paste the data from intake_paired.html into Excel. Use a formula like `=TTEST(A2:A12,B2:B12,2,1)` The fourth field (1) indicates that we're performing a paired test. If you were to try this as the usual unpaired t-test (with "3" for the fourth field), the p-value is very different (and wrong for these data). Read the table with a command like `intake = read.delim("intake_paired.txt", header=T)` Another way to try is open the file in Excel, select and copy the energy intake data, and then use the command `intake = read.delim(file("clipboard"), header=T) ` You might want to check that you got it right with `edit(intake)` To perform the test, use a command like `t.test(intake\$pre, intake\$post, paired=TRUE)` 7: Correct for multiple hypothesis testing by FWER (the Bonferroni correction). Start by downloading [[ expression data ]], showing expression levels for ten genes in 6 replicates each of tissues "A" and "B" (or get it below) Perform a t-test comparing levels across tissues, and then correct the p-values. Open the file expression.txt in Excel or paste the data from expression.html into Excel. Create a new column on the right called "p-value", and get p-values from a t-test using a command like `=TTEST(B2:G2,H2:M2,2,3)` Then copy the formula down the column. Count the number of genes tested in this experment (rows in the table). Multiply each p-value by this number, but don't go higher than 1. This can be done with a formula like `=MIN(N2*10,1)` where cell N2 contains the raw p-value. Copy this formula down the column. Read the table with a command like `exp = read.delim("expression.txt", h=T)` One way to do t-tests for each row and then correct the p-values is like this: ```# Get the number of rows (genes) num.genes = nrow(exp) # Make a matrix of "num.genes" rows and 2 columns, # filled with 0's p.vals = matrix(0, num.genes, 2) # Name the columns colnames(p.vals) = c("raw.p.value", "Bonferroni.p.values") # Count from 1 to "num.genes" for (i in 1:num.genes) { # Calculate the p-value from the t-test p.vals[i,1] = t.test(exp[i,2:7], exp[i,8:13])\$p.value # Multiply p-value by "num.genes" p.vals[i,2] = min (p.vals[i,1] * num.genes, 1) } # Add new columns to expression table exp.p = cbind(exp, p.vals)``` Okay, so some things are easier to do in Excel. To see the output, remember that you can use a command like `edit(exp.p)` or print the output to a file: ```write.table(exp.p, file = "pvals_Bonferroni.txt", sep="\t", quote=FALSE)``` 8: Correct for multiple hypothesis testing by FDR (the False Discovery Rate) This can now be done with a {{ web tool }} too. Can be done in Excel, but R is actually easier. If you really want to, sort p-values in increasing order. The last p-value in the list stays the same. The p-value one row above is calculated with a formula like ```=MIN(N10*(10/RANK(N10,\$N\$2: \$N\$11,1)),O11)``` where N10 contains the current raw p-value, N2 through N11 contain all the p-values, and O11 is the previous (one row below) corrected p-value. Copy the formula up the column. Using the previous table you created, grab the column with the raw p-values: `raw.pvals = exp.p\$raw.p.value` With one command, you can do a multiple hypothesis correction: `p.adjust(raw.pvals, method)` where the 'method' is one of six types: "fdr", "holm", "hochberg", "hommel", "bonferroni", "BH", and "BY". To get more information about these methods, get command documentation by typing `?p.adjust` We'll be using FDR (using the method of Benjamini and Hochberg), so type `FDR.p.values = p.adjust(raw.pvals, "fdr")` And now, merge these new columns with your previous table. `data.adjp = cbind(exp.p, FDR.p.values)` The FDR corrections are in the last column. Note how different the p-values are between the raw values, the Bonferroni-corrected values, and the FDR-corrected values. And finally, print all the data to a file: ```write.table(data.adjp, file = "pvals_corrected_all.txt", sep="\t", quote=FALSE)``` 9: Exit from applications You may wish to save your file and/or email it to yourself for reference. To get all of your work (input and text output) look for the "Save to file" choice on the menu. You can save the "history" output (during or at the end of your work) with a command like `savehistory(file = "R_history_class2.R")` To exit from R, use the pull-down menu or type `q()` for "quit".
 Dataset 1 for exercises above: energy expenditure in groups of lean and obese women (tab-delimited text) ```lean obese 7.53 9.21 7.48 11.51 8.08 12.79 8.09 11.85 10.15 9.97 8.4 8.79 10.88 9.69 6.13 9.68 7.9 9.19 7.05 7.48 7.58 8.11```
 Dataset 2 for exercises above: paired energy intake for women pre- and postmenstrually (tab-delimited text) ```pre post 5260 3910 5470 4220 5640 3885 6180 5160 6390 5645 6515 4680 6805 5265 7515 5975 7515 6790 8230 6900 8770 7335```
 Dataset 3 for exercises above: expression data for 10 genes across two tissue types (tab-delimited text) ```Gene A1 A2 A3 A4 A5 A6 B1 B2 B3 B4 B5 B6 Gene1 353 379 524 383 278 291 648 242 779 1307 567 490 Gene2 159 60 156 97 28 80 90 168 124 133 182 182 Gene3 1861 1603 2460 1486 1526 939 2259 1844 2345 1944 1225 2370 Gene4 171 80 397 75 44 113 286 464 525 460 258 429 Gene5 447 364 627 221 263 132 81 163 286 135 111 73 Gene6 3767 2810 2206 3723 4071 1379 624 680 1033 1687 1367 2194 Gene7 1915 3120 4072 2120 325 1210 1411 1007 2080 830 500 616 Gene8 20 59 213 20 35 286 20 20 195 406 366 1238 Gene9 835 935 1665 764 1323 1030 3237 1125 4655 3807 3593 2701 Gene10 516 702 423 388 569 60 269 339 188 94 171 139 ```