Introduction to R

Hands-on

This document accompanies the slides from BaRC’s Introduction to R and shows the use of some simple commands. See the accompanying slides and data files (and material from other Hot Topics talks) at http://barc.wi.mit.edu/hot_topics/ (when you click on “Statistics”). All of the commands below should work similarly if run on Windows/Mac/Linux (using the typical R installation or RStudio).

Our example data is the number of tumors in wild-type and knockout mice, each assayed in triplicate. For variable names, we use one-word (no spaces, anyway) names that can include any combination of lowercase and uppercase letters and numbers (although they must begin with a letter). It can be helpful to include dots to make the variables easier to read, like MGF.ko.trial.1 R will convert many other special characters (even dashes and underscores) into dots.

After starting your R session, save this file in your working directory.

# login to https://rstudio.wi.mit.edu
#setwd("/nfs/BaRC_training")
#dir.create("username")  Use your username (eg. jdoe)
# File -> Open File... -> "Intro_R.Rmd"
#setwd("/nfs/BaRC_training/userName")
#use "Save As..." to save the Rmd to your directory/folder

##Part I: Getting familiar with R

###1 Entering data by hand, The c() (combine) command is used to concatenate several pieces of data into a vector. These can be numbers or text (but the latter must be surrounded by quotes).

# These are the tumor counts for the WT animals.
wt = c(5, 6, 7)
# These are the tumor counts for the KO animals.
ko = c(8, 9, 11)
wt

## [1] 5 6 7

ko

## [1]  8  9 11

Note that typing the variable by itself prints the values. The [1] at the beginning of each output row shows that the first value in the row is the first entry in the vector (list). This becomes useful if we have a vector containing more values that can be printed on one line.

We’ve also included some comments. Whenever R sees a pound sign #, it ignores the rest of the line. A liberal use of comments is very helpful to remind you or show others what each command does and perhaps why you chose the specific options you did.

###2 Running a statistical test

t.test(wt, ko)

## 
##  Welch Two Sample t-test
## 
## data:  wt and ko
## t = -3.1623, df = 3.4483, p-value = 0.04191
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -6.4542173 -0.2124494
## sample estimates:
## mean of x mean of y 
##  6.000000  9.333333

Unlike Excel, the output for a statistical test provides more information than just the p-value. But what if we don’t want to use the default t-test options. How can we find out how to change the default settings? Try entering the command, preceded by a question mark, as in ?t.test

Running this command on a Windows or Mac computer will usually open a documentation page in your web browser, which can be easier to read than the text-only Unix way of printing documentation.

In RStudio, you can type t.test in the search box under “Help”.

We can also save the results of a statistical analysis as another variable which will then contain all the information that you saw printed to the screen (and sometimes a lot more). For example, let’s run a standard t-test (note the var.equal=T, which calculates separate variances for each variable).

wt.vs.ko = t.test(wt, ko, var.equal=T)
wt.vs.ko

## 
##  Two Sample t-test
## 
## data:  wt and ko
## t = -3.1623, df = 4, p-value = 0.03411
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -6.2599634 -0.4067032
## sample estimates:
## mean of x mean of y 
##  6.000000  9.333333

If we want to print or save just part of this output, we need to know how to access its different parts. Try the names() command to see what each part is called.

names(wt.vs.ko)

## [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
## [6] "null.value"  "alternative" "method"      "data.name"

Now we can access each part of this variable using the syntax VARIABLE$PART, such as

wt.vs.ko$p.value

## [1] 0.03410942

wt.vs.ko$conf.int

## [1] -6.2599634 -0.4067032
## attr(,"conf.level")
## [1] 0.95

What if we want to get the commands we used, to put them into a script to document our analyis pipeline? Use the “history” command, as in history(max.show=Inf). In RStudio, you simply click on “History”.

###3 Reading data from a file To get started you probably want to get R and your data in the same place. To get R’s working directory, try

getwd()

## [1] "/nfs/BaRC_training"

To change the working directory on your computer, go to File => Change dir… menu. On tak (or you can do it this way on your computer, too), use the setwd() command, like setwd(“/lab/solexa_public/Graceland_Lab/Elvis”) We can see what files we have in our directory. (You won’t have all of these.)

dir()

##  [1] "aflamier"                     "byuan"                       
##  [3] "david_pincus"                 "FILES_WILL_BE_DELETED_WEEKLY"
##  [5] "FOR_BARC_TRAINING_ONLY"       "Intro_R.Rmd"                 
##  [7] "Intro_to_Unix"                "lbechen"                     
##  [9] "rgt"                          "UnixII"

Let’s assume that we have a tab-delimited text file of our dataset (“tumors wt ko.txt”), with one column for wt and another for ko, with column labels in the first row. One way to read the file is with the read.delim() command, Note that we’ve included header=T to indicate that the first row has our column names. If we use header=F then the columns will just be numbered (and the data will include the first row of the file).

tumors = read.delim("http://barc.wi.mit.edu/education/hot_topics/Intro_to_R_2015/tumors_wt_ko.txt", header=T)
tumors

##   wt ko
## 1  5  8
## 2  6  9
## 3  7 11

Our data looks OK: we have one (labeled) column for each group of mice.

###4 Accessing a data matrix How do we access subsets of tumors? We can use column names or row and column numbers.

tumors$wt

## [1] 5 6 7

tumors$ko

## [1]  8  9 11

tumors[1:3,1]

## [1] 5 6 7

tumors[1:2,1:2]

##   wt ko
## 1  5  8
## 2  6  9

tumors[,1]

## [1] 5 6 7

t.test(tumors$wt, tumors$ko)

## 
##  Welch Two Sample t-test
## 
## data:  tumors$wt and tumors$ko
## t = -3.1623, df = 3.4483, p-value = 0.04191
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -6.4542173 -0.2124494
## sample estimates:
## mean of x mean of y 
##  6.000000  9.333333

###5 Creating an output table Unless we have just a little data, we may want to run several analyses and create a le to hold the output. One way to do that is to create a table the hold the output data. In this case, let’s say we want to try two variations of two statistical tests. We should note that normally we’d decide on a statistical test when designing an experiment, rather than trying a bunch of them after we already have the data. Let’s start by creating an empty matrix of two rows and two columns, labeled to show what we want each cell to hold.

pvals.out = matrix(data=NA, ncol=2, nrow=2)
colnames(pvals.out) = c("two.tail", "one.tail")
rownames(pvals.out) = c("Welch", "Wilcoxon")
pvals.out

##          two.tail one.tail
## Welch          NA       NA
## Wilcoxon       NA       NA

Now let’s run some statistical tests and put the output p-values into our table. We’ll start with Welch’s test for the first row, once as a two-sided test and once as a one-sided test.

pvals.out[1,1] = t.test(tumors$wt, tumors$ko)$p.value
pvals.out[1,2] = t.test(tumors$wt, tumors$ko, alt="less")$p.value

Now let’s try the Wilcoxon rank sum test, a non-parametric alternative to the t-test.

pvals.out[2,1] = wilcox.test(tumors$wt, tumors$ko)$p.value
pvals.out[2,2] = wilcox.test(tumors$wt, tumors$ko, alt="less")$p.value
pvals.out

##            two.tail   one.tail
## Welch    0.04191452 0.02095726
## Wilcoxon 0.10000000 0.05000000

Since we have row names, we can also use them to access subsets of our matrix

pvals.out["Welch",]

##   two.tail   one.tail 
## 0.04191452 0.02095726

###6 Printing an output table Our table looks fine, except for the fact that we have so many digits that some are meaningless. We could either round them now or later. If we choose to do it now, set the number of significant figures:

pvals.out.rounded = signif(pvals.out, 3)
pvals.out.rounded

##          two.tail one.tail
## Welch      0.0419    0.021
## Wilcoxon   0.1000    0.050

Now let’s print the table.

write.table(pvals.out.rounded, file="tumor_pvals.txt", sep="\t", quote=F)

We’ve included our output filename, that we want the file to be tab-delimited (sep=“”), and that we don’t want quotes around any text (quote=F). Now we can open “Tumor_pvals.txt” in any text editor or Excel. If we do so in Excel, however, our column labels will need to be shifted over one column. To avoid this, we could have created a 3-column table, with a first column containing “Welch” and “Wilcoxon” and then print the table with the additional option row.names=F.

###7 Creating figures R is very powerful and very exible with its figure generation. Besides statistics, graphics are another great reason to learn R. We can start by creating a simple boxplot of our data.

boxplot(tumors)

Normally if we only had three points of data, a scatterplot would be more informative than a boxplot. Nevertheless, let’s add some more details the make the figure more readable.

boxplot(tumors, col=c("gray", "red"), main="MFG appears to be a tumor suppressor", ylab="number of tumors")

If we run R on our computer, a figure will pop up and can then be saved by right-clicking on it. On tak, a figure will end up in a file called “Rplots.pdf”.

Normally it’s more useful to include filenames in our code, as this makes it more reproducible and more automated. To do this, we need to first issue a command telling R what graphical format to use. Then we write our plotting command(s). A PDF file can hold multiple figures, by default one per page. Finally, we need to “close” the file when we’re finished with it.

#Create a PDF file 11 inches wide and 8.5 inches high
pdf("tumor_boxplot.pdf", w=11, h=8.5)
boxplot(tumors)
dev.off()

## png 
##   2

Another common format is PNG. The syntax is similar, except the figure dimensions are in pixels rather than inches.

png("tumor_boxplot.png", w=1800, h=1200)
boxplot(tumors)
dev.off()

## png 
##   2

###8 Accessing very low p-values R has an unexpected habit of rounding very low p-values to “< 2.2e-16”, but we can do better than that approximation (if desired). Let’s start by creating some very extreme values.

a = 1:10
a

##  [1]  1  2  3  4  5  6  7  8  9 10

b = a + 1000
b

##  [1] 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010

Now let’s run Welch’s two sample t-test. Then let’s run the same test but explicitly ask for the p-value.

t.test(a,b)

## 
##  Welch Two Sample t-test
## 
## data:  a and b
## t = -738.55, df = 18, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1002.8447  -997.1553
## sample estimates:
## mean of x mean of y 
##       5.5    1005.5

t.test(a,b)$p.value

## [1] 8.605537e-42

Note that we get a much more precise p-value using the second commmand. We may want to round the value with a modified command like,

signif(t.test(a,b)$p.value, 1)

## [1] 9e-42

###9 Ending a session Let’s save all of our commands. We may want to edit out the not-so-good ones later, but at least we have them all.

#savehistory(file="R_intro_Hot_Topics_commands.R")
#Note: savehistory will not work when knit'd

What variables do we have so far?

ls()

## [1] "a"                 "b"                 "ko"               
## [4] "pvals.out"         "pvals.out.rounded" "tumors"           
## [7] "wt"                "wt.vs.ko"

And what files have we created?

dir(pattern="tumor*")

## [1] "tumor_boxplot.pdf" "tumor_boxplot.png" "tumor_pvals.txt"

In the interests of totally reproducible research, we’ll end by printing our R environment.

date()

## [1] "Mon Oct 15 20:12:43 2018"

print(sessionInfo(), locale=FALSE)

## R version 3.4.4 (2018-03-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.5 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.0
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.4.4  backports_1.1.2 magrittr_1.5    rprojroot_1.3-2
##  [5] tools_3.4.4     htmltools_0.3.6 yaml_2.1.19     Rcpp_0.12.18   
##  [9] stringi_1.2.4   rmarkdown_1.10  knitr_1.20      stringr_1.3.1  
## [13] digest_0.6.17   evaluate_0.10.1

This is just a start! There’s lots more….

To exit R, type q()

If you save the workspace image, you can start R again in about the same state as you are now. Then when you re-start R in the same directory, files called .RData and .Rhistory will reload of your data and command history.

Part II: tidyverse (optional)

Tidyverse is suite of packages, written primarily by Hadley Wickham, for data analysis. We’ll use three packages: readr, tidyr and dplyr, which makes accessing/analyzing data more easier than base R.

We’ll first load the packages,

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(readr)
library(tidyr)

Read in the data file,

data<-read_tsv("http://barc.wi.mit.edu/education/hot_topics/Intro_to_R_2016/normalizedCounts_subset.txt")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Gene = col_character()
## )

## See spec(...) for full column specifications.

head(data)

## # A tibble: 6 x 31
##   Gene    HMLE_1 HMLE_2 HMLE_3  N8_1  N8_2  N8_3 N8_lo_1 N8_lo_2 N8_lo_3
##   <chr>    <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>   <dbl>   <dbl>   <dbl>
## 1 C1orf43  2639.  2637   2140. 1974. 2325. 2721.  12383.  11561.  12654.
## 2 CHMP2A    396.   375.   308.  386.  390.  401.   3746.   3334.   3474.
## 3 EMC7      627.   659.   498.  670.  939.  624.   3926.   3887.   4368.
## 4 GPI      5739.  5868.  7034. 5778. 4972. 5461.  24078.  18815.  21059.
## 5 PSMB2    2770.  3052.  2986. 2989. 2574. 2113.  10581.  10733.  11364.
## 6 PSMB4    2844.  3279.  2797. 3216. 3303. 2677.  26545.  23818.  27669.
## # ... with 21 more variables: N8_lo_4 <dbl>, N8_lo_5 <dbl>, N8_lo_6 <dbl>,
## #   N8_hi_1 <dbl>, N8_hi_2 <dbl>, N8_hi_3 <dbl>, N8_hi_4 <dbl>,
## #   N8_hi_5 <dbl>, N8_hi_6 <dbl>, `2020_159_hi` <dbl>,
## #   `2021_159_lo` <dbl>, `2051_159_hi` <dbl>, `2052_159_lo` <dbl>,
## #   `2064_159_lo` <dbl>, `2065_159_hi` <dbl>, `2479_231_lo` <dbl>,
## #   `2480_231_hi` <dbl>, `2499_231_lo` <dbl>, `2500_231_hi` <dbl>,
## #   `2513_231_hi` <dbl>, `2514_231_lo` <dbl>

We’ll now access/query certain columns/rows using dplyr and demonstrate its functionality,

#columns 1 to 4; note the "%>%" is equivalent to the Unix pipe command where the output of one is passed as input to another method
data_HMLE<- select(data, 1:4)
data_HMLE <- data %>% select(1:4)
head(data_HMLE)

## # A tibble: 6 x 4
##   Gene    HMLE_1 HMLE_2 HMLE_3
##   <chr>    <dbl>  <dbl>  <dbl>
## 1 C1orf43  2639.  2637   2140.
## 2 CHMP2A    396.   375.   308.
## 3 EMC7      627.   659.   498.
## 4 GPI      5739.  5868.  7034.
## 5 PSMB2    2770.  3052.  2986.
## 6 PSMB4    2844.  3279.  2797.

#only column "Gene"
data_rowNames <- select(data,Gene)
head(data_rowNames)

## # A tibble: 6 x 1
##   Gene   
##   <chr>  
## 1 C1orf43
## 2 CHMP2A 
## 3 EMC7   
## 4 GPI    
## 5 PSMB2  
## 6 PSMB4

#rename Column Gene to Genes
data_rowNames <- select(data,Genes=Gene)
head(data_rowNames)

## # A tibble: 6 x 1
##   Genes  
##   <chr>  
## 1 C1orf43
## 2 CHMP2A 
## 3 EMC7   
## 4 GPI    
## 5 PSMB2  
## 6 PSMB4

#Columns Gene to N8_3 (by column name)
data_range <- select(data, Gene:N8_3)
head(data_range)

## # A tibble: 6 x 7
##   Gene    HMLE_1 HMLE_2 HMLE_3  N8_1  N8_2  N8_3
##   <chr>    <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1 C1orf43  2639.  2637   2140. 1974. 2325. 2721.
## 2 CHMP2A    396.   375.   308.  386.  390.  401.
## 3 EMC7      627.   659.   498.  670.  939.  624.
## 4 GPI      5739.  5868.  7034. 5778. 4972. 5461.
## 5 PSMB2    2770.  3052.  2986. 2989. 2574. 2113.
## 6 PSMB4    2844.  3279.  2797. 3216. 3303. 2677.

#All columns except Gene
data_noGene <- select(data,-Gene)
head(data_noGene)

## # A tibble: 6 x 30
##   HMLE_1 HMLE_2 HMLE_3  N8_1  N8_2  N8_3 N8_lo_1 N8_lo_2 N8_lo_3 N8_lo_4
##    <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1  2639.  2637   2140. 1974. 2325. 2721.  12383.  11561.  12654.  12025.
## 2   396.   375.   308.  386.  390.  401.   3746.   3334.   3474.   3485.
## 3   627.   659.   498.  670.  939.  624.   3926.   3887.   4368.   3674.
## 4  5739.  5868.  7034. 5778. 4972. 5461.  24078.  18815.  21059.  23859.
## 5  2770.  3052.  2986. 2989. 2574. 2113.  10581.  10733.  11364.  10265.
## 6  2844.  3279.  2797. 3216. 3303. 2677.  26545.  23818.  27669.  26145.
## # ... with 20 more variables: N8_lo_5 <dbl>, N8_lo_6 <dbl>, N8_hi_1 <dbl>,
## #   N8_hi_2 <dbl>, N8_hi_3 <dbl>, N8_hi_4 <dbl>, N8_hi_5 <dbl>,
## #   N8_hi_6 <dbl>, `2020_159_hi` <dbl>, `2021_159_lo` <dbl>,
## #   `2051_159_hi` <dbl>, `2052_159_lo` <dbl>, `2064_159_lo` <dbl>,
## #   `2065_159_hi` <dbl>, `2479_231_lo` <dbl>, `2480_231_hi` <dbl>,
## #   `2499_231_lo` <dbl>, `2500_231_hi` <dbl>, `2513_231_hi` <dbl>,
## #   `2514_231_lo` <dbl>

#select columns with "lo"
#?contains
data_lo <- select(data, contains("lo"))
head(data_lo)

## # A tibble: 6 x 12
##   N8_lo_1 N8_lo_2 N8_lo_3 N8_lo_4 N8_lo_5 N8_lo_6 `2021_159_lo`
##     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>         <dbl>
## 1  12383.  11561.  12654.  12025.  11570.  12584.         4041.
## 2   3746.   3334.   3474.   3485.   3196.   3505.         1114.
## 3   3926.   3887.   4368.   3674.   3862.   4361.         1269.
## 4  24078.  18815.  21059.  23859.  18633.  21224.         7659.
## 5  10581.  10733.  11364.  10265.  10857.  11163.         3630.
## 6  26545.  23818.  27669.  26145.  23917.  27725.         3967.
## # ... with 5 more variables: `2052_159_lo` <dbl>, `2064_159_lo` <dbl>,
## #   `2479_231_lo` <dbl>, `2499_231_lo` <dbl>, `2514_231_lo` <dbl>

data_lo <- select(data, matches('lo|Gene'))
head(data_lo)

## # A tibble: 6 x 13
##   Gene    N8_lo_1 N8_lo_2 N8_lo_3 N8_lo_4 N8_lo_5 N8_lo_6 `2021_159_lo`
##   <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>         <dbl>
## 1 C1orf43  12383.  11561.  12654.  12025.  11570.  12584.         4041.
## 2 CHMP2A    3746.   3334.   3474.   3485.   3196.   3505.         1114.
## 3 EMC7      3926.   3887.   4368.   3674.   3862.   4361.         1269.
## 4 GPI      24078.  18815.  21059.  23859.  18633.  21224.         7659.
## 5 PSMB2    10581.  10733.  11364.  10265.  10857.  11163.         3630.
## 6 PSMB4    26545.  23818.  27669.  26145.  23917.  27725.         3967.
## # ... with 5 more variables: `2052_159_lo` <dbl>, `2064_159_lo` <dbl>,
## #   `2479_231_lo` <dbl>, `2499_231_lo` <dbl>, `2514_231_lo` <dbl>

# add a column with average gene expression
data_lo_mean <- select(data_lo,-Gene) %>% mutate(average=rowMeans(.))
data_lo_mean

## # A tibble: 11 x 13
##    N8_lo_1 N8_lo_2 N8_lo_3 N8_lo_4 N8_lo_5 N8_lo_6 `2021_159_lo`
##      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>         <dbl>
##  1  12383.  11561.  12654.  12025.  11570.  12584.         4041.
##  2   3746.   3334.   3474.   3485.   3196.   3505.         1114.
##  3   3926.   3887.   4368.   3674.   3862.   4361.         1269.
##  4  24078.  18815.  21059.  23859.  18633.  21224.         7659.
##  5  10581.  10733.  11364.  10265.  10857.  11163.         3630.
##  6  26545.  23818.  27669.  26145.  23917.  27725.         3967.
##  7   4207.   4423.   4262.   4037.   4545.   4282.         5842.
##  8   5427.   5190.   5698.   5552.   5079.   5800.         2484.
##  9   1280.   1442.   1244.   1262.   1430.   1189.         1995.
## 10  20143.  19895.  21588.  20532.  19714.  21545.         9019.
## 11   1896.   2487.   2181.   1913.   2501.   2215.         1320.
## # ... with 6 more variables: `2052_159_lo` <dbl>, `2064_159_lo` <dbl>,
## #   `2479_231_lo` <dbl>, `2499_231_lo` <dbl>, `2514_231_lo` <dbl>,
## #   average <dbl>

#select the first 5 rows
data_slice<-slice(data,1:5)
head(data_slice)

## # A tibble: 5 x 31
##   Gene    HMLE_1 HMLE_2 HMLE_3  N8_1  N8_2  N8_3 N8_lo_1 N8_lo_2 N8_lo_3
##   <chr>    <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>   <dbl>   <dbl>   <dbl>
## 1 C1orf43  2639.  2637   2140. 1974. 2325. 2721.  12383.  11561.  12654.
## 2 CHMP2A    396.   375.   308.  386.  390.  401.   3746.   3334.   3474.
## 3 EMC7      627.   659.   498.  670.  939.  624.   3926.   3887.   4368.
## 4 GPI      5739.  5868.  7034. 5778. 4972. 5461.  24078.  18815.  21059.
## 5 PSMB2    2770.  3052.  2986. 2989. 2574. 2113.  10581.  10733.  11364.
## # ... with 21 more variables: N8_lo_4 <dbl>, N8_lo_5 <dbl>, N8_lo_6 <dbl>,
## #   N8_hi_1 <dbl>, N8_hi_2 <dbl>, N8_hi_3 <dbl>, N8_hi_4 <dbl>,
## #   N8_hi_5 <dbl>, N8_hi_6 <dbl>, `2020_159_hi` <dbl>,
## #   `2021_159_lo` <dbl>, `2051_159_hi` <dbl>, `2052_159_lo` <dbl>,
## #   `2064_159_lo` <dbl>, `2065_159_hi` <dbl>, `2479_231_lo` <dbl>,
## #   `2480_231_hi` <dbl>, `2499_231_lo` <dbl>, `2500_231_hi` <dbl>,
## #   `2513_231_hi` <dbl>, `2514_231_lo` <dbl>

#select certain genes
filter(data,Gene %in% c("EMC7","GPI"))

## # A tibble: 2 x 31
##   Gene  HMLE_1 HMLE_2 HMLE_3  N8_1  N8_2  N8_3 N8_lo_1 N8_lo_2 N8_lo_3
##   <chr>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>   <dbl>   <dbl>   <dbl>
## 1 EMC7    627.   659.   498.  670.  939.  624.   3926.   3887.   4368.
## 2 GPI    5739.  5868.  7034. 5778. 4972. 5461.  24078.  18815.  21059.
## # ... with 21 more variables: N8_lo_4 <dbl>, N8_lo_5 <dbl>, N8_lo_6 <dbl>,
## #   N8_hi_1 <dbl>, N8_hi_2 <dbl>, N8_hi_3 <dbl>, N8_hi_4 <dbl>,
## #   N8_hi_5 <dbl>, N8_hi_6 <dbl>, `2020_159_hi` <dbl>,
## #   `2021_159_lo` <dbl>, `2051_159_hi` <dbl>, `2052_159_lo` <dbl>,
## #   `2064_159_lo` <dbl>, `2065_159_hi` <dbl>, `2479_231_lo` <dbl>,
## #   `2480_231_hi` <dbl>, `2499_231_lo` <dbl>, `2500_231_hi` <dbl>,
## #   `2513_231_hi` <dbl>, `2514_231_lo` <dbl>

dplyr has function group_by and summarise which can be very helpful in grouping/summarising data

# Gather data to make it 'tidy';
data2<-data %>% gather(Sample, Value, HMLE_1:`2514_231_lo`)
head(data)

## # A tibble: 6 x 31
##   Gene    HMLE_1 HMLE_2 HMLE_3  N8_1  N8_2  N8_3 N8_lo_1 N8_lo_2 N8_lo_3
##   <chr>    <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>   <dbl>   <dbl>   <dbl>
## 1 C1orf43  2639.  2637   2140. 1974. 2325. 2721.  12383.  11561.  12654.
## 2 CHMP2A    396.   375.   308.  386.  390.  401.   3746.   3334.   3474.
## 3 EMC7      627.   659.   498.  670.  939.  624.   3926.   3887.   4368.
## 4 GPI      5739.  5868.  7034. 5778. 4972. 5461.  24078.  18815.  21059.
## 5 PSMB2    2770.  3052.  2986. 2989. 2574. 2113.  10581.  10733.  11364.
## 6 PSMB4    2844.  3279.  2797. 3216. 3303. 2677.  26545.  23818.  27669.
## # ... with 21 more variables: N8_lo_4 <dbl>, N8_lo_5 <dbl>, N8_lo_6 <dbl>,
## #   N8_hi_1 <dbl>, N8_hi_2 <dbl>, N8_hi_3 <dbl>, N8_hi_4 <dbl>,
## #   N8_hi_5 <dbl>, N8_hi_6 <dbl>, `2020_159_hi` <dbl>,
## #   `2021_159_lo` <dbl>, `2051_159_hi` <dbl>, `2052_159_lo` <dbl>,
## #   `2064_159_lo` <dbl>, `2065_159_hi` <dbl>, `2479_231_lo` <dbl>,
## #   `2480_231_hi` <dbl>, `2499_231_lo` <dbl>, `2500_231_hi` <dbl>,
## #   `2513_231_hi` <dbl>, `2514_231_lo` <dbl>

head(data2)

## # A tibble: 6 x 3
##   Gene    Sample Value
##   <chr>   <chr>  <dbl>
## 1 C1orf43 HMLE_1 2639.
## 2 CHMP2A  HMLE_1  396.
## 3 EMC7    HMLE_1  627.
## 4 GPI     HMLE_1 5739.
## 5 PSMB2   HMLE_1 2770.
## 6 PSMB4   HMLE_1 2844.

#select values greater than 10000
data2_genesHigh <- filter(data2, Value > 10000)
data2_genesHigh %>% select(Gene) %>% distinct()

## # A tibble: 5 x 1
##   Gene   
##   <chr>  
## 1 C1orf43
## 2 GPI    
## 3 PSMB2  
## 4 PSMB4  
## 5 VCP

#Spread (opposite of gather)
data2_genesHighSpread <- spread(data2_genesHigh,Sample,Value)
data2_genesHighSpread

## # A tibble: 5 x 19
##   Gene    `2479_231_lo` `2480_231_hi` `2499_231_lo` `2500_231_hi`
##   <chr>           <dbl>         <dbl>         <dbl>         <dbl>
## 1 C1orf43        12555.           NA         10356.           NA 
## 2 GPI            20824.        19720.        11163.        17936.
## 3 PSMB2             NA            NA            NA            NA 
## 4 PSMB4          20726.        18809.        12315.        18363.
## 5 VCP            25282.        23515.        18529.        23857.
## # ... with 14 more variables: `2513_231_hi` <dbl>, `2514_231_lo` <dbl>,
## #   N8_hi_1 <dbl>, N8_hi_2 <dbl>, N8_hi_3 <dbl>, N8_hi_4 <dbl>,
## #   N8_hi_5 <dbl>, N8_hi_6 <dbl>, N8_lo_1 <dbl>, N8_lo_2 <dbl>,
## #   N8_lo_3 <dbl>, N8_lo_4 <dbl>, N8_lo_5 <dbl>, N8_lo_6 <dbl>

#group_by and summarise
data2_genesMean <- data2 %>% group_by(Gene) %>% summarise(average=mean(Value))
data2_genesMean

## # A tibble: 11 x 2
##    Gene    average
##    <chr>     <dbl>
##  1 C1orf43   8466.
##  2 CHMP2A    2408.
##  3 EMC7      2427.
##  4 GPI      13500.
##  5 PSMB2     6930.
##  6 PSMB4    14747.
##  7 RAB7A     4290.
##  8 REEP5     3781.
##  9 SNRPD3    1203.
## 10 VCP      15041.
## 11 VPS29     1867.

ggplot2 is a powerful popular graphics package that can be easily used with other tidyverse packages

library(ggplot2)

# boxplot on gene expression
data2 %>% ggplot(aes(x=factor(Gene), y=Value)) +geom_boxplot() +theme(axis.text.x = element_text(angle = 90))

# compare gene expression between two samples
data2 %>% filter(Sample %in% c("HMLE_1","HMLE_2")) %>% ggplot(aes(x=factor(Gene), y=Value, color=factor(Sample))) + geom_point() + theme(axis.text.x = element_text(angle = 90))