WIBR Bioinformatics and Research Computing

Statistics for Biologists (November 2007)

URL: http://iona.wi.mit.edu/bio/education/stats2007/

Exercises for lecture 1

Goals for class 1 exercises: Use descriptive statistics and figures to describe data with Excel and R. See how a series of commands can be used to create an R pipeline in a script file. Note that the bold terms are the commands for Excel or R. You'll start the exercises by reading data from a text file.

Links to data for download are in double square brackets [[ like this ]]. Links to other pages are in double curly brackets {{ like this }}.

Start R on your desktop or laptop by clicking on the icon. Alternatively, you may also connect to R on tak

TaskUsing ExcelUsing R
1: Select (right click) and download data about [[ Energy expenditure by lean and obese women ]] to your desktop.

If that isn't available, you can copy the data from the bottom of page, paste into a text editor, and save the file as plain text on your desktop.

Open the tab-delimited text file energy.txt you should have downloaded (from the "Task" column at left) or paste the data from [[ energy.html ]] into Excel. After starting R, move R to the directory you want by using the pull-down menu:

For Windows: File >> Change dir >> Browse >> [Browse to Desktop] >> OK

For Mac: Misc >> Change dir >> Browse >> [Browse to Desktop] >> OK

If you don't want to use the pull-down menu, you can move the data file into working directory ("wd"). To find the working directory, use the command

Then you can change (set) the working directory to match the location of data with a command like

setwd("C:/Documents and Settings/student/Desktop")
assuming "student" is your laptop username.

Then read the file by typing a command like

energy = read.delim("energy.txt", header=T)
(and hitting "enter").

To see the "energy" dataset, either type

With the "edit" view, if fields are changed, energy actually stays the same. To change the dataset (not now!) use a command like "fix(energy)".
2: Get the mean and median of each column of data Use formulas like
Use commands like
which are different ways of describing the "lean" column, which is in column 1.

If you have any "NA" values (type

to see), you may need
mean(energy$lean, na.rm=T)
which removes the "NA" fields. Then similarly get the median with
or simply
3: Get the standard deviation of each column of data Use a formula like
Use a command like
4: View the distribution of data with a histogram and a q-q plot. Go to pull-down menu: Tools >> Data Analysis.

If this option is not present, you need to add or install the Analysis ToolPak. Go to Tools >> Add-Ins and check the box.

Before going further, create a "bin range" in another column of your Excel file. Perusing your data, decide the size of the bins you'd like to represent on the histogram. For the energy data, one possible choice is 6 through 11 in intervals of 0.5 units. So make a column containing 6, 6.5, 7, ..., 11.

Select Histogram, enter or select the Input Range and the Bin Range, check "Chart Output", and click "OK".

Q-Q plots aren't very straightforward in Excel.

Use a command like
hist(energy$lean, breaks=20)
where 'breaks' describes how many bins to create.

to see all the options. You can get help on any command by typing a question mark before the command.

To change the color of the bars, for example, type

hist(energy$lean, breaks=20, col="green")
To see how normal the data are, try a q-q plot:
qqnorm(energy$lean, pch=19, cex=2)
qqline(energy$lean, lwd=2, col="blue")
Optional arguments used:
  • "pch=19" => draw solid circles
  • "cex=2" => make the points twice as large as usual
  • "lwd=2" => make lines twice as wide as usual

    Points that are normally distributed appear along the line, which passes through the first and third quartiles (25th and 75th percentiles).

  • 5: Draw a scatterplot showing all the data One method is to reformat to the data, with group number (1 or 2) in the first column and the data for that group in the second column. This should look like [[ energy_by_group.html ]]. Click on the Chart Wizard, select 'XY (Scatter)' and click on "Finish". Then you may wish to clear the x-axis and add the two group names as text boxes. Use a command like
    stripchart(energy, vertical=T, method="jitter")
    where 'jitter' does that to the points so they don't overlap so much.

    to see all the options.
    5: Draw a boxplot Boxplots are not possible in Excel -- unless you want to use lots of hacks. Use a command like
    To add some more options, type
    boxplot(energy, col=c("red", "blue"),
    main="This is a boxplot in R",
    xlab="Exp groups", ylab="Units go here")
    6: Run a script in R Given the script below, there are at least three ways to run it:

    1: Save [[ this file ]] (also shown below in the colored box) into your "working directory" [run getwd() to find out], calling it something like script1.R     The ".R" at the end isn't necessary but helpful to label it as an R script. Then, within R, type


    2: Copy and paste the commands into your R terminal. This can be done one line at a time or all at once

    3: (For tak and other Unix systems): From the Unix prompt [without logging in to R already], type

    ./script1.R > script1.R.out.txt
    Any output will be saved in "script1.R.out.txt". If you get a "Permission denied" error, you need to make the script "executable" with a command like
    chmod +x script1.R
    Then you can run the previous command. This is a quick way to run a script you know already works, but it's not a good way to debug and modify the commands.

    # Read built-in data set "women"
    # Get names of each column in the "women" matrix
    # Get the dimensions of the "women" matrix
    # Calculate the BMI for these women
    # and round the results to one decimal place
    BMI = round((women$weight/2.2)/((women$height/39.37)**2),1)
    # plot the data as points, saving the figure in a file
    pdf("women_weight_vs_height_1.pdf", width=11, height=8.5)
    plot(women$weight, women$height)
    # Stop saving to the file
    # Add some more options to the figure and print it to another file.
    # This command is long so it's split into multiple rows
    # col => color of points
    # pch=19 => draw filled circles for points; type '?points' for all choices
    # cex=2 => draw points twice as big as usual
    pdf("women_weight_vs_height_2.pdf", width=11, height=8.5)
    plot(women$weight, women$height,
    main = "Data for American women aged 30-39",
    xlab = "Height (in)", ylab = "Weight (lb)",
    col="red", pch=19, cex=2)
    # Add the BMI data to the 'women' table by adding a column ('cbind')
    women.incl.BMI = cbind(women, BMI)
    # Print a tab-delimited file into the working directory
    # Don't put quotes around fields
    write.table(women.incl.BMI, "Women.BMI.data.txt", sep="\t", quote=FALSE)

    After running the script, you should find three files (two PDF files and a table) in the working directory. If you're on tak, after exiting from R (see next step), you can view the PDF files using the Unix command

    /nfs/BaRC/apps/adobe/bin/acroread myfile.pdf
    where 'acroread' is the Unix command to open Acrobat Reader.
    7: Exit from applications You may wish to save your file and/or email it to yourself for reference. To get all of your work (input and text output) look for the "Save to file" choice on the menu.

    If you wish to see your previous commands, type

    or see the history using the history button (Mac only).

    You can also save the "history" output (during or at the end of your work) with a command like

    savehistory(file = "R_history_class1.R")

    To exit from R, use the pull-down menu or type

    for "quit".

    When you exit R, you'll be asked

    "Save workspace image?"

    If you answer "yes", when you re-start R, the same commands you performed this session will be performed -- except packages (like you'll use later) will need to be loaded.

    Data for exercises above: energy expenditure in groups of lean and obese women (tab-delimited text)
    lean	obese
    7.53	9.21
    7.48	11.51
    8.08	12.79
    8.09	11.85
    10.15	9.97
    8.4	8.79
    10.88	9.69
    6.13	9.68
    7.9	9.19

    Questions or comments?   wibr-bioinformatics@wi.mit.edu
    Bioinformatics and Research Computing at Whitehead Institute