Introduction to R

Hands-on

This document accompanies the slides from BaRC’s Introduction to R and shows the use of some simple commands. See the accompanying slides and data files (and material from other Hot Topics talks) at http://barc.wi.mit.edu/hot_topics/ (when you click on “Statistics”). All of the commands below should work similarly if run on Windows/Mac/Linux (using the typical R installation or RStudio).

Our example data is the number of tumors in wild-type and knockout mice, each assayed in triplicate. For variable names, we use one-word (no spaces, anyway) names that can include any combination of lowercase and uppercase letters and numbers (although they must begin with a letter). It can be helpful to include dots to make the variables easier to read, like MGF.ko.trial.1 R will convert many other special characters (even dashes and underscores) into dots.

After starting your R session, save this file in your working directory.

##Part I: Getting familiar with R

###1 Entering data by hand, The c() (combine) command is used to concatenate several pieces of data into a vector. These can be numbers or text (but the latter must be surrounded by quotes).

# These are the tumor counts for the WT animals.
wt = c(5, 6, 7)
# These are the tumor counts for the KO animals.
ko = c(8, 9, 11)
wt

## [1] 5 6 7

ko

## [1]  8  9 11

Note that typing the variable by itself prints the values. The [1] at the beginning of each output row shows that the first value in the row is the first entry in the vector (list). This becomes useful if we have a vector containing more values that can be printed on one line.

We’ve also included some comments. Whenever R sees a pound sign #, it ignores the rest of the line. A liberal use of comments is very helpful to remind you or show others what each command does and perhaps why you chose the specific options you did.

###2 Running a statistical test

t.test(wt, ko)

## 
##  Welch Two Sample t-test
## 
## data:  wt and ko
## t = -3.1623, df = 3.4483, p-value = 0.04191
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -6.4542173 -0.2124494
## sample estimates:
## mean of x mean of y 
##  6.000000  9.333333

Unlike Excel, the output for a statistical test provides more information than just the p-value. But what if we don’t want to use the default t-test options. How can we find out how to change the default settings? Try entering the command, preceded by a question mark, as in ?t.test

Running this command on a Windows or Mac computer will usually open a documentation page in your web browser, which can be easier to read than the text-only Unix way of printing documentation.

We can also save the results of a statistical analysis as another variable which will then contain all the information that you saw printed to the screen (and sometimes a lot more). For example, let’s run a standard (i.e. Student’s) t-test (note the var.equal=T, which calculates equal variances for each variable).

wt.vs.ko = t.test(wt, ko, var.equal=T)
wt.vs.ko

## 
##  Two Sample t-test
## 
## data:  wt and ko
## t = -3.1623, df = 4, p-value = 0.03411
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -6.2599634 -0.4067032
## sample estimates:
## mean of x mean of y 
##  6.000000  9.333333

If we want to print or save just part of this output, we need to know how to access its different parts. Try the names() command to see what each part is called.

names(wt.vs.ko)

##  [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
##  [6] "null.value"  "stderr"      "alternative" "method"      "data.name"

Now we can access each part of this variable using the syntax VARIABLE$PART, such as

wt.vs.ko$p.value

## [1] 0.03410942

wt.vs.ko$conf.int

## [1] -6.2599634 -0.4067032
## attr(,"conf.level")
## [1] 0.95

What if we want to get the commands we used, to put them into a script to document our analyis pipeline? Use the “history” command, as in history(max.show=Inf).

###3 Reading data from a file To get started you probably want to get R and your data in the same place. To get R’s working directory, try

getwd()

## [1] "/nfs/BaRC_training/Intro_to_R"

To change the working directory on your computer, go to File => Change dir… menu. On tak (or you can do it this way on your computer, too), use the setwd() command, like setwd(“/lab/solexa_public/Graceland_Lab/Elvis”) We can see what files we have in our directory. (You won’t have all of these.)

dir()

## [1] "Intro_R.html"      "Intro_R.Rmd"       "tumor_boxplot.pdf"
## [4] "tumor_boxplot.png" "tumor_pvals.txt"

Let’s assume that we have a tab-delimited text file of our dataset (“tumors wt ko.txt”), with one column for wt and another for ko, with column labels in the first row. One way to read the file is with the read.delim() command, Note that we’ve included header=T to indicate that the first row has our column names. If we use header=F then the columns will just be numbered (and the data will include the first row of the file).

tumors = read.delim("http://barc.wi.mit.edu/education/hot_topics/Intro_to_R_2015/tumors_wt_ko.txt", header=T)
tumors

##   wt ko
## 1  5  8
## 2  6  9
## 3  7 11

Our data looks OK: we have one (labeled) column for each group of mice.

###4 Accessing a data matrix How do we access subsets of tumors? We can use column names or row and column numbers.

tumors$wt

## [1] 5 6 7

tumors$ko

## [1]  8  9 11

tumors[1:3,1]

## [1] 5 6 7

tumors[1:2,1:2]

##   wt ko
## 1  5  8
## 2  6  9

tumors[,1]

## [1] 5 6 7

t.test(tumors$wt, tumors$ko)

## 
##  Welch Two Sample t-test
## 
## data:  tumors$wt and tumors$ko
## t = -3.1623, df = 3.4483, p-value = 0.04191
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -6.4542173 -0.2124494
## sample estimates:
## mean of x mean of y 
##  6.000000  9.333333

###5 Creating an output table Unless we have just a little data, we may want to run several analyses and create a le to hold the output. One way to do that is to create a table the hold the output data. In this case, let’s say we want to try two variations of two statistical tests. We should note that normally we’d decide on a statistical test when designing an experiment, rather than trying a bunch of them after we already have the data. Let’s start by creating an empty matrix of two rows and two columns, labeled to show what we want each cell to hold.

pvals.out = matrix(data=NA, ncol=2, nrow=2)
colnames(pvals.out) = c("two.tail", "one.tail")
rownames(pvals.out) = c("Welch", "Wilcoxon")
pvals.out

##          two.tail one.tail
## Welch          NA       NA
## Wilcoxon       NA       NA

Now let’s run some statistical tests and put the output p-values into our table. We’ll start with Welch’s test for the first row, once as a two-sided test and once as a one-sided test.

pvals.out[1,1] = t.test(tumors$wt, tumors$ko)$p.value
pvals.out[1,2] = t.test(tumors$wt, tumors$ko, alt="less")$p.value

Now let’s try the Wilcoxon rank sum test, a non-parametric alternative to the t-test.

pvals.out[2,1] = wilcox.test(tumors$wt, tumors$ko)$p.value
pvals.out[2,2] = wilcox.test(tumors$wt, tumors$ko, alt="less")$p.value
pvals.out

##            two.tail   one.tail
## Welch    0.04191452 0.02095726
## Wilcoxon 0.10000000 0.05000000

Since we have row names, we can also use them to access subsets of our matrix

pvals.out["Welch",]

##   two.tail   one.tail 
## 0.04191452 0.02095726

###6 Printing an output table Our table looks fine, except for the fact that we have so many digits that some are meaningless. We could either round them now or later. If we choose to do it now, set the number of significant figures:

pvals.out.rounded = signif(pvals.out, 3)
pvals.out.rounded

##          two.tail one.tail
## Welch      0.0419    0.021
## Wilcoxon   0.1000    0.050

Now let’s print the table.

write.table(pvals.out.rounded, file="tumor_pvals.txt", sep="\t", quote=F)

We’ve included our output filename, that we want the file to be tab-delimited (sep=“”), and that we don’t want quotes around any text (quote=F). Now we can open “Tumor_pvals.txt” in any text editor or Excel. If we do so in Excel, however, our column labels will need to be shifted over one column. To avoid this, we could have created a 3-column table, with a first column containing “Welch” and “Wilcoxon” { and then print the table with the additional option row.names=F.

###7 Creating figures R is very powerful and very exible with its figure generation. Besides statistics, graphics are another great reason to learn R. We can start by creating a simple boxplot of our data.

boxplot(tumors)

Normally if we only had three points of data, a scatterplot would be more informative than a boxplot. Nevertheless, let’s add some more details the make the figure more readable.

boxplot(tumors, col=c("gray", "red"), main="MFG appears to be a tumor suppressor", ylab="number of tumors")

If we run R on our computer, a figure will pop up and can then be saved by right-clicking on it. On tak, a figure will end up in a file called “Rplots.pdf”.

Normally it’s more useful to include filenames in our code, as this makes it more reproducible and more automated. To do this, we need to first issue a command telling R what graphical format to use. Then we write our plotting command(s). A PDF file can hold multiple figures, by default one per page. Finally, we need to “close” the file when we’re finished with it.

#Create a PDF file 11 inches wide and 8.5 inches high
pdf("tumor_boxplot.pdf", w=11, h=8.5)
boxplot(tumors)
dev.off()

## png 
##   2

Another common format is PNG. The syntax is similar, except the figure dimensions are in pixels rather than inches.

png("tumor_boxplot.png", w=1800, h=1200)
boxplot(tumors)
dev.off()

## png 
##   2

###8 Accessing very low p-values R has an unexpected habit of rounding very low p-values to “< 2.2e-16”, but we can do better than that approximation (if desired). Let’s start by creating some very extreme values.

a = 1:10
a

##  [1]  1  2  3  4  5  6  7  8  9 10

b = a + 1000
b

##  [1] 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010

Now let’s run Welch’s two sample t-test. Then let’s run the same test but explicitly ask for the p-value.

t.test(a,b)

## 
##  Welch Two Sample t-test
## 
## data:  a and b
## t = -738.55, df = 18, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1002.8447  -997.1553
## sample estimates:
## mean of x mean of y 
##       5.5    1005.5

t.test(a,b)$p.value

## [1] 8.605537e-42

Note that we get a much more precise p-value using the second commmand. We may want to round the value with a modified command like,

signif(t.test(a,b)$p.value, 1)

## [1] 9e-42

###9 Ending a session Let’s save all of our commands. We may want to edit out the not-so-good ones later, but at least we have them all.

#savehistory(file="R_intro_Hot_Topics_commands.R")
#Note: savehistory will not work when knit'd

What variables do we have so far?

ls()

## [1] "a"                 "b"                 "ko"               
## [4] "pvals.out"         "pvals.out.rounded" "tumors"           
## [7] "wt"                "wt.vs.ko"

And what files have we created?

dir(pattern="tumor*")

## [1] "tumor_boxplot.pdf" "tumor_boxplot.png" "tumor_pvals.txt"

In the interests of totally reproducible research, we’ll end by printing our R environment.

date()

## [1] "Thu Nov 14 11:19:21 2024"

print(sessionInfo(), locale=FALSE)

## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.6 LTS
## 
## Matrix products: default
## BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.33     R6_2.5.1          lifecycle_1.0.4   jsonlite_1.8.8   
##  [5] evaluate_0.23     highr_0.10        cachem_1.0.8      rlang_1.1.2      
##  [9] cli_3.6.1         rstudioapi_0.15.0 jquerylib_0.1.4   bslib_0.6.1      
## [13] rmarkdown_2.25    tools_4.2.1       xfun_0.41         yaml_2.3.7       
## [17] fastmap_1.1.1     compiler_4.2.1    htmltools_0.5.7   knitr_1.45       
## [21] sass_0.4.8

This is just a start! There’s lots more….

To exit R, type q()

If you save the workspace image, you can start R again in about the same state as you are now. Then when you re-start R in the same directory, files called .RData and .Rhistory will reload of your data and command history.

###Part II: tidyverse

Tidyverse is suite of packages, written primarily by Hadley Wickham, for data analysis. We’ll use three packages: readr, tidyr and dplyr, which can make accessing/analyzing data easier than base R.

We’ll first load the packages,

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(readr)
library(tidyr)

Read in the data file,

data<-read_tsv("http://barc.wi.mit.edu/education/hot_topics/Intro_to_R_2016/normalizedCounts_subset.txt")

## Rows: 11 Columns: 31
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr  (1): Gene
## dbl (30): HMLE_1, HMLE_2, HMLE_3, N8_1, N8_2, N8_3, N8_lo_1, N8_lo_2, N8_lo_...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(data)

## # A tibble: 6 × 31
##   Gene    HMLE_1 HMLE_2 HMLE_3  N8_1  N8_2  N8_3 N8_lo_1 N8_lo_2 N8_lo_3 N8_lo_4
##   <chr>    <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 C1orf43  2639.  2637   2140. 1974. 2325. 2721.  12383.  11561.  12654.  12025.
## 2 CHMP2A    396.   375.   308.  386.  390.  401.   3746.   3334.   3474.   3485.
## 3 EMC7      627.   659.   498.  670.  939.  624.   3926.   3887.   4368.   3674.
## 4 GPI      5739.  5868.  7034. 5778. 4972. 5461.  24078.  18815.  21059.  23859.
## 5 PSMB2    2770.  3052.  2986. 2989. 2574. 2113.  10581.  10733.  11364.  10265.
## 6 PSMB4    2844.  3279.  2797. 3216. 3303. 2677.  26545.  23818.  27669.  26145.
## # ℹ 20 more variables: N8_lo_5 <dbl>, N8_lo_6 <dbl>, N8_hi_1 <dbl>,
## #   N8_hi_2 <dbl>, N8_hi_3 <dbl>, N8_hi_4 <dbl>, N8_hi_5 <dbl>, N8_hi_6 <dbl>,
## #   `2020_159_hi` <dbl>, `2021_159_lo` <dbl>, `2051_159_hi` <dbl>,
## #   `2052_159_lo` <dbl>, `2064_159_lo` <dbl>, `2065_159_hi` <dbl>,
## #   `2479_231_lo` <dbl>, `2480_231_hi` <dbl>, `2499_231_lo` <dbl>,
## #   `2500_231_hi` <dbl>, `2513_231_hi` <dbl>, `2514_231_lo` <dbl>

We’ll now access/query certain columns using dplyr and show its functionality,

#columns 1 to 4
data_HMLE<- select(data, 1:4)
data_HMLE <- data %>% select(1:4)
head(data_HMLE)

## # A tibble: 6 × 4
##   Gene    HMLE_1 HMLE_2 HMLE_3
##   <chr>    <dbl>  <dbl>  <dbl>
## 1 C1orf43  2639.  2637   2140.
## 2 CHMP2A    396.   375.   308.
## 3 EMC7      627.   659.   498.
## 4 GPI      5739.  5868.  7034.
## 5 PSMB2    2770.  3052.  2986.
## 6 PSMB4    2844.  3279.  2797.

#only column "Gene"
data_rowNames <- select(data,Gene)
head(data_rowNames)

## # A tibble: 6 × 1
##   Gene   
##   <chr>  
## 1 C1orf43
## 2 CHMP2A 
## 3 EMC7   
## 4 GPI    
## 5 PSMB2  
## 6 PSMB4

#rename Column Gene to Gens
data_rowNames <- select(data,Genes=Gene)
head(data_rowNames)

## # A tibble: 6 × 1
##   Genes  
##   <chr>  
## 1 C1orf43
## 2 CHMP2A 
## 3 EMC7   
## 4 GPI    
## 5 PSMB2  
## 6 PSMB4

#Columns Gene to N8_3 (by column name)
data_range <- select(data, Gene:N8_3)
head(data_range)

## # A tibble: 6 × 7
##   Gene    HMLE_1 HMLE_2 HMLE_3  N8_1  N8_2  N8_3
##   <chr>    <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1 C1orf43  2639.  2637   2140. 1974. 2325. 2721.
## 2 CHMP2A    396.   375.   308.  386.  390.  401.
## 3 EMC7      627.   659.   498.  670.  939.  624.
## 4 GPI      5739.  5868.  7034. 5778. 4972. 5461.
## 5 PSMB2    2770.  3052.  2986. 2989. 2574. 2113.
## 6 PSMB4    2844.  3279.  2797. 3216. 3303. 2677.

#All columns except Gene
data_noGene <- select(data,-Gene)
head(data_noGene)

## # A tibble: 6 × 30
##   HMLE_1 HMLE_2 HMLE_3  N8_1  N8_2  N8_3 N8_lo_1 N8_lo_2 N8_lo_3 N8_lo_4 N8_lo_5
##    <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1  2639.  2637   2140. 1974. 2325. 2721.  12383.  11561.  12654.  12025.  11570.
## 2   396.   375.   308.  386.  390.  401.   3746.   3334.   3474.   3485.   3196.
## 3   627.   659.   498.  670.  939.  624.   3926.   3887.   4368.   3674.   3862.
## 4  5739.  5868.  7034. 5778. 4972. 5461.  24078.  18815.  21059.  23859.  18633.
## 5  2770.  3052.  2986. 2989. 2574. 2113.  10581.  10733.  11364.  10265.  10857.
## 6  2844.  3279.  2797. 3216. 3303. 2677.  26545.  23818.  27669.  26145.  23917.
## # ℹ 19 more variables: N8_lo_6 <dbl>, N8_hi_1 <dbl>, N8_hi_2 <dbl>,
## #   N8_hi_3 <dbl>, N8_hi_4 <dbl>, N8_hi_5 <dbl>, N8_hi_6 <dbl>,
## #   `2020_159_hi` <dbl>, `2021_159_lo` <dbl>, `2051_159_hi` <dbl>,
## #   `2052_159_lo` <dbl>, `2064_159_lo` <dbl>, `2065_159_hi` <dbl>,
## #   `2479_231_lo` <dbl>, `2480_231_hi` <dbl>, `2499_231_lo` <dbl>,
## #   `2500_231_hi` <dbl>, `2513_231_hi` <dbl>, `2514_231_lo` <dbl>

#select columns with "lo" and calculate the mean
#?contains
data_lo <- select(data, contains("lo"))
head(data_lo)

## # A tibble: 6 × 12
##   N8_lo_1 N8_lo_2 N8_lo_3 N8_lo_4 N8_lo_5 N8_lo_6 `2021_159_lo` `2052_159_lo`
##     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>         <dbl>         <dbl>
## 1  12383.  11561.  12654.  12025.  11570.  12584.         4041.         6757.
## 2   3746.   3334.   3474.   3485.   3196.   3505.         1114.         1672.
## 3   3926.   3887.   4368.   3674.   3862.   4361.         1269.         1512.
## 4  24078.  18815.  21059.  23859.  18633.  21224.         7659.         5312.
## 5  10581.  10733.  11364.  10265.  10857.  11163.         3630.         2144.
## 6  26545.  23818.  27669.  26145.  23917.  27725.         3967.         3550 
## # ℹ 4 more variables: `2064_159_lo` <dbl>, `2479_231_lo` <dbl>,
## #   `2499_231_lo` <dbl>, `2514_231_lo` <dbl>

data_lo <- select(data, matches('lo|Gene'))
head(data_lo)

## # A tibble: 6 × 13
##   Gene    N8_lo_1 N8_lo_2 N8_lo_3 N8_lo_4 N8_lo_5 N8_lo_6 `2021_159_lo`
##   <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>         <dbl>
## 1 C1orf43  12383.  11561.  12654.  12025.  11570.  12584.         4041.
## 2 CHMP2A    3746.   3334.   3474.   3485.   3196.   3505.         1114.
## 3 EMC7      3926.   3887.   4368.   3674.   3862.   4361.         1269.
## 4 GPI      24078.  18815.  21059.  23859.  18633.  21224.         7659.
## 5 PSMB2    10581.  10733.  11364.  10265.  10857.  11163.         3630.
## 6 PSMB4    26545.  23818.  27669.  26145.  23917.  27725.         3967.
## # ℹ 5 more variables: `2052_159_lo` <dbl>, `2064_159_lo` <dbl>,
## #   `2479_231_lo` <dbl>, `2499_231_lo` <dbl>, `2514_231_lo` <dbl>

data_lo_mean <- select(data_lo,-Gene) %>% mutate(average=rowMeans(.))
data_lo_mean

## # A tibble: 11 × 13
##    N8_lo_1 N8_lo_2 N8_lo_3 N8_lo_4 N8_lo_5 N8_lo_6 `2021_159_lo` `2052_159_lo`
##      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>         <dbl>         <dbl>
##  1  12383.  11561.  12654.  12025.  11570.  12584.         4041.         6757.
##  2   3746.   3334.   3474.   3485.   3196.   3505.         1114.         1672.
##  3   3926.   3887.   4368.   3674.   3862.   4361.         1269.         1512.
##  4  24078.  18815.  21059.  23859.  18633.  21224.         7659.         5312.
##  5  10581.  10733.  11364.  10265.  10857.  11163.         3630.         2144.
##  6  26545.  23818.  27669.  26145.  23917.  27725.         3967.         3550 
##  7   4207.   4423.   4262.   4037.   4545.   4282.         5842.         4290.
##  8   5427.   5190.   5698.   5552.   5079.   5800.         2484.         2056.
##  9   1280.   1442.   1244.   1262.   1430.   1189.         1995.         1151.
## 10  20143.  19895.  21588.  20532.  19714.  21545.         9019.         3533.
## 11   1896.   2487.   2181.   1913.   2501.   2215.         1320.          620.
## # ℹ 5 more variables: `2064_159_lo` <dbl>, `2479_231_lo` <dbl>,
## #   `2499_231_lo` <dbl>, `2514_231_lo` <dbl>, average <dbl>

dplyr has function group_by and summarise which can be very helpful in grouping/summarising data

#Gather data to make it 'tidy'
data2<-data %>% gather(Sample, Value, HMLE_1:`2514_231_lo`)
knitr::kable(data)

Gene	HMLE_1	HMLE_2	HMLE_3	N8_1	N8_2	N8_3	N8_lo_1	N8_lo_2	N8_lo_3	N8_lo_4	N8_lo_5	N8_lo_6	N8_hi_1	N8_hi_2	N8_hi_3	N8_hi_4	N8_hi_5	N8_hi_6	2020_159_hi	2021_159_lo	2051_159_hi	2052_159_lo	2064_159_lo	2065_159_hi	2479_231_lo	2480_231_hi	2499_231_lo	2500_231_hi	2513_231_hi	2514_231_lo
C1orf43	2638.63	2637.00	2139.90	1974.41	2324.54	2720.88	12383.21	11561.25	12653.96	12025.11	11569.77	12584.14	10748.37	11580.84	11820.45	10911.87	11559.48	12014.27	4721.38	4040.94	6221.58	6756.85	6799.84	6313.16	12554.60	9495.91	10355.72	9697.18	7085.38	14096.89
CHMP2A	396.12	374.64	308.01	385.71	390.24	401.35	3745.64	3333.62	3474.17	3485.49	3195.98	3505.24	3250.63	3447.26	3273.25	3254.42	3453.26	3220.14	1152.03	1114.13	1539.93	1671.94	1653.83	1525.45	3257.41	4617.31	2688.81	4625.35	2265.31	3231.16
EMC7	627.20	659.25	497.97	670.10	939.40	624.18	3926.17	3886.75	4368.18	3673.52	3861.96	4360.74	4224.40	4401.20	4717.53	4091.59	4324.93	4737.03	1036.89	1268.55	1031.23	1512.44	1301.88	1011.79	1932.92	1911.13	1600.03	1997.40	1689.02	1917.48
GPI	5739.40	5868.39	7034.34	5778.32	4972.38	5461.45	24077.59	18815.02	21058.83	23858.70	18632.68	21223.80	20728.80	14331.97	19301.18	21043.71	14461.49	19351.92	7651.48	7659.03	6225.22	5312.36	5250.23	5892.63	20824.48	19719.69	11163.31	17936.13	9238.90	16378.83
PSMB2	2770.12	3052.46	2986.46	2989.33	2574.36	2112.81	10581.38	10733.41	11363.88	10264.95	10856.66	11163.09	9971.91	10309.92	10846.72	10176.13	10261.09	11339.53	2992.39	3630.22	1982.40	2143.70	2273.16	2088.62	8792.90	9162.77	6029.00	9473.94	6233.50	8740.18
PSMB4	2843.84	3278.91	2797.41	3216.03	3302.58	2677.03	26545.47	23817.93	27668.51	26144.64	23917.14	27724.72	22598.26	24148.99	26886.67	22290.90	23938.99	26663.68	3796.48	3967.37	3411.72	3550.00	3617.75	3177.27	20726.20	18808.92	12314.58	18363.02	10926.18	19297.21
RAB7A	3471.04	3693.25	4161.31	3209.67	3022.29	2965.63	4206.61	4422.55	4261.99	4037.39	4545.31	4281.79	3386.90	4530.12	3187.38	3590.15	4400.86	3339.22	7583.52	5842.31	6789.95	4289.65	3540.02	5098.12	3665.75	4083.54	4963.89	4437.36	4479.49	5209.84
REEP5	2844.94	2601.21	2491.71	2723.11	3065.28	2949.52	5426.53	5189.75	5698.44	5552.06	5078.59	5800.45	4905.75	5547.76	5747.94	5083.06	5620.29	5901.87	1449.00	2483.75	1165.14	2055.53	1353.13	1071.66	3824.88	3912.77	2905.62	3903.41	2800.87	4275.58
SNRPD3	1580.10	1821.74	1493.00	1463.32	1213.16	642.52	1279.51	1441.60	1244.16	1262.23	1429.94	1188.51	1006.42	1501.44	1169.51	1062.90	1479.10	1104.44	1537.09	1994.60	1179.69	1150.76	800.43	914.23	654.06	684.94	782.03	731.07	1493.16	787.39
VCP	8162.37	7444.10	8229.02	7667.78	8802.89	7981.43	20142.66	19895.29	21588.35	20531.65	19714.13	21545.33	22224.94	19112.33	19903.97	22435.19	19222.26	20376.98	6020.01	9018.95	3701.36	3532.59	4122.61	3993.95	25282.36	23514.88	18529.14	23856.52	14987.10	19691.58
VPS29	2404.81	3198.40	1956.39	1900.82	2673.96	1332.03	1896.48	2487.22	2181.22	1913.36	2501.47	2214.82	1797.07	2789.05	2287.50	1668.64	2709.16	2210.61	806.61	1320.30	566.92	620.03	392.96	381.36	2130.65	1242.98	2666.09	1359.02	1741.01	2658.64

knitr::kable(data2)

Gene	Sample	Value
C1orf43	HMLE_1	2638.63
CHMP2A	HMLE_1	396.12
EMC7	HMLE_1	627.20
GPI	HMLE_1	5739.40
PSMB2	HMLE_1	2770.12
PSMB4	HMLE_1	2843.84
RAB7A	HMLE_1	3471.04
REEP5	HMLE_1	2844.94
SNRPD3	HMLE_1	1580.10
VCP	HMLE_1	8162.37
VPS29	HMLE_1	2404.81
C1orf43	HMLE_2	2637.00
CHMP2A	HMLE_2	374.64
EMC7	HMLE_2	659.25
GPI	HMLE_2	5868.39
PSMB2	HMLE_2	3052.46
PSMB4	HMLE_2	3278.91
RAB7A	HMLE_2	3693.25
REEP5	HMLE_2	2601.21
SNRPD3	HMLE_2	1821.74
VCP	HMLE_2	7444.10
VPS29	HMLE_2	3198.40
C1orf43	HMLE_3	2139.90
CHMP2A	HMLE_3	308.01
EMC7	HMLE_3	497.97
GPI	HMLE_3	7034.34
PSMB2	HMLE_3	2986.46
PSMB4	HMLE_3	2797.41
RAB7A	HMLE_3	4161.31
REEP5	HMLE_3	2491.71
SNRPD3	HMLE_3	1493.00
VCP	HMLE_3	8229.02
VPS29	HMLE_3	1956.39
C1orf43	N8_1	1974.41
CHMP2A	N8_1	385.71
EMC7	N8_1	670.10
GPI	N8_1	5778.32
PSMB2	N8_1	2989.33
PSMB4	N8_1	3216.03
RAB7A	N8_1	3209.67
REEP5	N8_1	2723.11
SNRPD3	N8_1	1463.32
VCP	N8_1	7667.78
VPS29	N8_1	1900.82
C1orf43	N8_2	2324.54
CHMP2A	N8_2	390.24
EMC7	N8_2	939.40
GPI	N8_2	4972.38
PSMB2	N8_2	2574.36
PSMB4	N8_2	3302.58
RAB7A	N8_2	3022.29
REEP5	N8_2	3065.28
SNRPD3	N8_2	1213.16
VCP	N8_2	8802.89
VPS29	N8_2	2673.96
C1orf43	N8_3	2720.88
CHMP2A	N8_3	401.35
EMC7	N8_3	624.18
GPI	N8_3	5461.45
PSMB2	N8_3	2112.81
PSMB4	N8_3	2677.03
RAB7A	N8_3	2965.63
REEP5	N8_3	2949.52
SNRPD3	N8_3	642.52
VCP	N8_3	7981.43
VPS29	N8_3	1332.03
C1orf43	N8_lo_1	12383.21
CHMP2A	N8_lo_1	3745.64
EMC7	N8_lo_1	3926.17
GPI	N8_lo_1	24077.59
PSMB2	N8_lo_1	10581.38
PSMB4	N8_lo_1	26545.47
RAB7A	N8_lo_1	4206.61
REEP5	N8_lo_1	5426.53
SNRPD3	N8_lo_1	1279.51
VCP	N8_lo_1	20142.66
VPS29	N8_lo_1	1896.48
C1orf43	N8_lo_2	11561.25
CHMP2A	N8_lo_2	3333.62
EMC7	N8_lo_2	3886.75
GPI	N8_lo_2	18815.02
PSMB2	N8_lo_2	10733.41
PSMB4	N8_lo_2	23817.93
RAB7A	N8_lo_2	4422.55
REEP5	N8_lo_2	5189.75
SNRPD3	N8_lo_2	1441.60
VCP	N8_lo_2	19895.29
VPS29	N8_lo_2	2487.22
C1orf43	N8_lo_3	12653.96
CHMP2A	N8_lo_3	3474.17
EMC7	N8_lo_3	4368.18
GPI	N8_lo_3	21058.83
PSMB2	N8_lo_3	11363.88
PSMB4	N8_lo_3	27668.51
RAB7A	N8_lo_3	4261.99
REEP5	N8_lo_3	5698.44
SNRPD3	N8_lo_3	1244.16
VCP	N8_lo_3	21588.35
VPS29	N8_lo_3	2181.22
C1orf43	N8_lo_4	12025.11
CHMP2A	N8_lo_4	3485.49
EMC7	N8_lo_4	3673.52
GPI	N8_lo_4	23858.70
PSMB2	N8_lo_4	10264.95
PSMB4	N8_lo_4	26144.64
RAB7A	N8_lo_4	4037.39
REEP5	N8_lo_4	5552.06
SNRPD3	N8_lo_4	1262.23
VCP	N8_lo_4	20531.65
VPS29	N8_lo_4	1913.36
C1orf43	N8_lo_5	11569.77
CHMP2A	N8_lo_5	3195.98
EMC7	N8_lo_5	3861.96
GPI	N8_lo_5	18632.68
PSMB2	N8_lo_5	10856.66
PSMB4	N8_lo_5	23917.14
RAB7A	N8_lo_5	4545.31
REEP5	N8_lo_5	5078.59
SNRPD3	N8_lo_5	1429.94
VCP	N8_lo_5	19714.13
VPS29	N8_lo_5	2501.47
C1orf43	N8_lo_6	12584.14
CHMP2A	N8_lo_6	3505.24
EMC7	N8_lo_6	4360.74
GPI	N8_lo_6	21223.80
PSMB2	N8_lo_6	11163.09
PSMB4	N8_lo_6	27724.72
RAB7A	N8_lo_6	4281.79
REEP5	N8_lo_6	5800.45
SNRPD3	N8_lo_6	1188.51
VCP	N8_lo_6	21545.33
VPS29	N8_lo_6	2214.82
C1orf43	N8_hi_1	10748.37
CHMP2A	N8_hi_1	3250.63
EMC7	N8_hi_1	4224.40
GPI	N8_hi_1	20728.80
PSMB2	N8_hi_1	9971.91
PSMB4	N8_hi_1	22598.26
RAB7A	N8_hi_1	3386.90
REEP5	N8_hi_1	4905.75
SNRPD3	N8_hi_1	1006.42
VCP	N8_hi_1	22224.94
VPS29	N8_hi_1	1797.07
C1orf43	N8_hi_2	11580.84
CHMP2A	N8_hi_2	3447.26
EMC7	N8_hi_2	4401.20
GPI	N8_hi_2	14331.97
PSMB2	N8_hi_2	10309.92
PSMB4	N8_hi_2	24148.99
RAB7A	N8_hi_2	4530.12
REEP5	N8_hi_2	5547.76
SNRPD3	N8_hi_2	1501.44
VCP	N8_hi_2	19112.33
VPS29	N8_hi_2	2789.05
C1orf43	N8_hi_3	11820.45
CHMP2A	N8_hi_3	3273.25
EMC7	N8_hi_3	4717.53
GPI	N8_hi_3	19301.18
PSMB2	N8_hi_3	10846.72
PSMB4	N8_hi_3	26886.67
RAB7A	N8_hi_3	3187.38
REEP5	N8_hi_3	5747.94
SNRPD3	N8_hi_3	1169.51
VCP	N8_hi_3	19903.97
VPS29	N8_hi_3	2287.50
C1orf43	N8_hi_4	10911.87
CHMP2A	N8_hi_4	3254.42
EMC7	N8_hi_4	4091.59
GPI	N8_hi_4	21043.71
PSMB2	N8_hi_4	10176.13
PSMB4	N8_hi_4	22290.90
RAB7A	N8_hi_4	3590.15
REEP5	N8_hi_4	5083.06
SNRPD3	N8_hi_4	1062.90
VCP	N8_hi_4	22435.19
VPS29	N8_hi_4	1668.64
C1orf43	N8_hi_5	11559.48
CHMP2A	N8_hi_5	3453.26
EMC7	N8_hi_5	4324.93
GPI	N8_hi_5	14461.49
PSMB2	N8_hi_5	10261.09
PSMB4	N8_hi_5	23938.99
RAB7A	N8_hi_5	4400.86
REEP5	N8_hi_5	5620.29
SNRPD3	N8_hi_5	1479.10
VCP	N8_hi_5	19222.26
VPS29	N8_hi_5	2709.16
C1orf43	N8_hi_6	12014.27
CHMP2A	N8_hi_6	3220.14
EMC7	N8_hi_6	4737.03
GPI	N8_hi_6	19351.92
PSMB2	N8_hi_6	11339.53
PSMB4	N8_hi_6	26663.68
RAB7A	N8_hi_6	3339.22
REEP5	N8_hi_6	5901.87
SNRPD3	N8_hi_6	1104.44
VCP	N8_hi_6	20376.98
VPS29	N8_hi_6	2210.61
C1orf43	2020_159_hi	4721.38
CHMP2A	2020_159_hi	1152.03
EMC7	2020_159_hi	1036.89
GPI	2020_159_hi	7651.48
PSMB2	2020_159_hi	2992.39
PSMB4	2020_159_hi	3796.48
RAB7A	2020_159_hi	7583.52
REEP5	2020_159_hi	1449.00
SNRPD3	2020_159_hi	1537.09
VCP	2020_159_hi	6020.01
VPS29	2020_159_hi	806.61
C1orf43	2021_159_lo	4040.94
CHMP2A	2021_159_lo	1114.13
EMC7	2021_159_lo	1268.55
GPI	2021_159_lo	7659.03
PSMB2	2021_159_lo	3630.22
PSMB4	2021_159_lo	3967.37
RAB7A	2021_159_lo	5842.31
REEP5	2021_159_lo	2483.75
SNRPD3	2021_159_lo	1994.60
VCP	2021_159_lo	9018.95
VPS29	2021_159_lo	1320.30
C1orf43	2051_159_hi	6221.58
CHMP2A	2051_159_hi	1539.93
EMC7	2051_159_hi	1031.23
GPI	2051_159_hi	6225.22
PSMB2	2051_159_hi	1982.40
PSMB4	2051_159_hi	3411.72
RAB7A	2051_159_hi	6789.95
REEP5	2051_159_hi	1165.14
SNRPD3	2051_159_hi	1179.69
VCP	2051_159_hi	3701.36
VPS29	2051_159_hi	566.92
C1orf43	2052_159_lo	6756.85
CHMP2A	2052_159_lo	1671.94
EMC7	2052_159_lo	1512.44
GPI	2052_159_lo	5312.36
PSMB2	2052_159_lo	2143.70
PSMB4	2052_159_lo	3550.00
RAB7A	2052_159_lo	4289.65
REEP5	2052_159_lo	2055.53
SNRPD3	2052_159_lo	1150.76
VCP	2052_159_lo	3532.59
VPS29	2052_159_lo	620.03
C1orf43	2064_159_lo	6799.84
CHMP2A	2064_159_lo	1653.83
EMC7	2064_159_lo	1301.88
GPI	2064_159_lo	5250.23
PSMB2	2064_159_lo	2273.16
PSMB4	2064_159_lo	3617.75
RAB7A	2064_159_lo	3540.02
REEP5	2064_159_lo	1353.13
SNRPD3	2064_159_lo	800.43
VCP	2064_159_lo	4122.61
VPS29	2064_159_lo	392.96
C1orf43	2065_159_hi	6313.16
CHMP2A	2065_159_hi	1525.45
EMC7	2065_159_hi	1011.79
GPI	2065_159_hi	5892.63
PSMB2	2065_159_hi	2088.62
PSMB4	2065_159_hi	3177.27
RAB7A	2065_159_hi	5098.12
REEP5	2065_159_hi	1071.66
SNRPD3	2065_159_hi	914.23
VCP	2065_159_hi	3993.95
VPS29	2065_159_hi	381.36
C1orf43	2479_231_lo	12554.60
CHMP2A	2479_231_lo	3257.41
EMC7	2479_231_lo	1932.92
GPI	2479_231_lo	20824.48
PSMB2	2479_231_lo	8792.90
PSMB4	2479_231_lo	20726.20
RAB7A	2479_231_lo	3665.75
REEP5	2479_231_lo	3824.88
SNRPD3	2479_231_lo	654.06
VCP	2479_231_lo	25282.36
VPS29	2479_231_lo	2130.65
C1orf43	2480_231_hi	9495.91
CHMP2A	2480_231_hi	4617.31
EMC7	2480_231_hi	1911.13
GPI	2480_231_hi	19719.69
PSMB2	2480_231_hi	9162.77
PSMB4	2480_231_hi	18808.92
RAB7A	2480_231_hi	4083.54
REEP5	2480_231_hi	3912.77
SNRPD3	2480_231_hi	684.94
VCP	2480_231_hi	23514.88
VPS29	2480_231_hi	1242.98
C1orf43	2499_231_lo	10355.72
CHMP2A	2499_231_lo	2688.81
EMC7	2499_231_lo	1600.03
GPI	2499_231_lo	11163.31
PSMB2	2499_231_lo	6029.00
PSMB4	2499_231_lo	12314.58
RAB7A	2499_231_lo	4963.89
REEP5	2499_231_lo	2905.62
SNRPD3	2499_231_lo	782.03
VCP	2499_231_lo	18529.14
VPS29	2499_231_lo	2666.09
C1orf43	2500_231_hi	9697.18
CHMP2A	2500_231_hi	4625.35
EMC7	2500_231_hi	1997.40
GPI	2500_231_hi	17936.13
PSMB2	2500_231_hi	9473.94
PSMB4	2500_231_hi	18363.02
RAB7A	2500_231_hi	4437.36
REEP5	2500_231_hi	3903.41
SNRPD3	2500_231_hi	731.07
VCP	2500_231_hi	23856.52
VPS29	2500_231_hi	1359.02
C1orf43	2513_231_hi	7085.38
CHMP2A	2513_231_hi	2265.31
EMC7	2513_231_hi	1689.02
GPI	2513_231_hi	9238.90
PSMB2	2513_231_hi	6233.50
PSMB4	2513_231_hi	10926.18
RAB7A	2513_231_hi	4479.49
REEP5	2513_231_hi	2800.87
SNRPD3	2513_231_hi	1493.16
VCP	2513_231_hi	14987.10
VPS29	2513_231_hi	1741.01
C1orf43	2514_231_lo	14096.89
CHMP2A	2514_231_lo	3231.16
EMC7	2514_231_lo	1917.48
GPI	2514_231_lo	16378.83
PSMB2	2514_231_lo	8740.18
PSMB4	2514_231_lo	19297.21
RAB7A	2514_231_lo	5209.84
REEP5	2514_231_lo	4275.58
SNRPD3	2514_231_lo	787.39
VCP	2514_231_lo	19691.58
VPS29	2514_231_lo	2658.64

#select values greater than 10000
data2_genesHigh <- filter(data2, Value > 10000)
data2_genesHigh %>% select(Gene) %>% distinct()

## # A tibble: 5 × 1
##   Gene   
##   <chr>  
## 1 C1orf43
## 2 GPI    
## 3 PSMB2  
## 4 PSMB4  
## 5 VCP

#Spread (opposite of gather)
data2_genesHighSpread <- spread(data2_genesHigh,Sample,Value)
data2_genesHighSpread

## # A tibble: 5 × 19
##   Gene    `2479_231_lo` `2480_231_hi` `2499_231_lo` `2500_231_hi` `2513_231_hi`
##   <chr>           <dbl>         <dbl>         <dbl>         <dbl>         <dbl>
## 1 C1orf43        12555.           NA         10356.           NA            NA 
## 2 GPI            20824.        19720.        11163.        17936.           NA 
## 3 PSMB2             NA            NA            NA            NA            NA 
## 4 PSMB4          20726.        18809.        12315.        18363.        10926.
## 5 VCP            25282.        23515.        18529.        23857.        14987.
## # ℹ 13 more variables: `2514_231_lo` <dbl>, N8_hi_1 <dbl>, N8_hi_2 <dbl>,
## #   N8_hi_3 <dbl>, N8_hi_4 <dbl>, N8_hi_5 <dbl>, N8_hi_6 <dbl>, N8_lo_1 <dbl>,
## #   N8_lo_2 <dbl>, N8_lo_3 <dbl>, N8_lo_4 <dbl>, N8_lo_5 <dbl>, N8_lo_6 <dbl>

#group_by and summarise
data2_genesMean <- data2 %>% group_by(Gene) %>% summarise(average=mean(Value))
data2_genesMean

## # A tibble: 11 × 2
##    Gene    average
##    <chr>     <dbl>
##  1 C1orf43   8466.
##  2 CHMP2A    2408.
##  3 EMC7      2427.
##  4 GPI      13500.
##  5 PSMB2     6930.
##  6 PSMB4    14747.
##  7 RAB7A     4290.
##  8 REEP5     3781.
##  9 SNRPD3    1203.
## 10 VCP      15041.
## 11 VPS29     1867.

ggplot2 is a powerful popular graphics package that can be easily used with other tidyverse packages

library(ggplot2)

data2 %>% ggplot(aes(x=factor(Gene),y=Value))+geom_boxplot()+theme(axis.text.x = element_text(angle = 90, hjust = 1))

data2 %>% filter(Sample %in% c("HMLE_1","HMLE_2")) %>% ggplot(aes(x=factor(Gene),y=Value))+geom_point()+facet_grid(. ~ Sample)+theme(axis.text.x = element_text(angle = 90, hjust = 1)) + geom_line ()

## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?