Hands-on

This document accompanies the slides from BaRC’s Introduction to R and shows the use of some simple commands. See the accompanying slides and data files (and material from other Hot Topics talks) at http://barc.wi.mit.edu/hot_topics/ (when you click on “Statistics”). All of the commands below should work similarly if run on Windows/Mac/Linux (using the typical R installation or RStudio).

Our example data is the number of tumors in wild-type and knockout mice, each assayed in triplicate. For variable names, we use one-word (no spaces, anyway) names that can include any combination of lowercase and uppercase letters and numbers (although they must begin with a letter). It can be helpful to include dots to make the variables easier to read, like MGF.ko.trial.1 R will convert many other special characters (even dashes and underscores) into dots.

After starting your R session, save this file in your working directory.

##Part I: Getting familiar with R

###1 Entering data by hand, The c() (combine) command is used to concatenate several pieces of data into a vector. These can be numbers or text (but the latter must be surrounded by quotes).

# These are the tumor counts for the WT animals.
wt = c(5, 6, 7)
# These are the tumor counts for the KO animals.
ko = c(8, 9, 11)
wt
## [1] 5 6 7
ko
## [1]  8  9 11

Note that typing the variable by itself prints the values. The [1] at the beginning of each output row shows that the first value in the row is the first entry in the vector (list). This becomes useful if we have a vector containing more values that can be printed on one line.

We’ve also included some comments. Whenever R sees a pound sign #, it ignores the rest of the line. A liberal use of comments is very helpful to remind you or show others what each command does and perhaps why you chose the specific options you did.

###2 Running a statistical test

t.test(wt, ko)
## 
##  Welch Two Sample t-test
## 
## data:  wt and ko
## t = -3.1623, df = 3.4483, p-value = 0.04191
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -6.4542173 -0.2124494
## sample estimates:
## mean of x mean of y 
##  6.000000  9.333333

Unlike Excel, the output for a statistical test provides more information than just the p-value. But what if we don’t want to use the default t-test options. How can we find out how to change the default settings? Try entering the command, preceded by a question mark, as in ?t.test

Running this command on a Windows or Mac computer will usually open a documentation page in your web browser, which can be easier to read than the text-only Unix way of printing documentation.

We can also save the results of a statistical analysis as another variable which will then contain all the information that you saw printed to the screen (and sometimes a lot more). For example, let’s run a standard (i.e. Student’s) t-test (note the var.equal=T, which calculates equal variances for each variable).

wt.vs.ko = t.test(wt, ko, var.equal=T)
wt.vs.ko
## 
##  Two Sample t-test
## 
## data:  wt and ko
## t = -3.1623, df = 4, p-value = 0.03411
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -6.2599634 -0.4067032
## sample estimates:
## mean of x mean of y 
##  6.000000  9.333333

If we want to print or save just part of this output, we need to know how to access its different parts. Try the names() command to see what each part is called.

names(wt.vs.ko)
##  [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
##  [6] "null.value"  "stderr"      "alternative" "method"      "data.name"

Now we can access each part of this variable using the syntax VARIABLE$PART, such as

wt.vs.ko$p.value
## [1] 0.03410942
wt.vs.ko$conf.int
## [1] -6.2599634 -0.4067032
## attr(,"conf.level")
## [1] 0.95

What if we want to get the commands we used, to put them into a script to document our analyis pipeline? Use the “history” command, as in history(max.show=Inf).

###3 Reading data from a file To get started you probably want to get R and your data in the same place. To get R’s working directory, try

getwd()
## [1] "/nfs/BaRC_training/Intro_to_R"

To change the working directory on your computer, go to File => Change dir… menu. On tak (or you can do it this way on your computer, too), use the setwd() command, like setwd(“/lab/solexa_public/Graceland_Lab/Elvis”) We can see what files we have in our directory. (You won’t have all of these.)

dir()
## [1] "Intro_R.html"      "Intro_R.Rmd"       "tumor_boxplot.pdf"
## [4] "tumor_boxplot.png" "tumor_pvals.txt"

Let’s assume that we have a tab-delimited text file of our dataset (“tumors wt ko.txt”), with one column for wt and another for ko, with column labels in the first row. One way to read the file is with the read.delim() command, Note that we’ve included header=T to indicate that the first row has our column names. If we use header=F then the columns will just be numbered (and the data will include the first row of the file).

tumors = read.delim("http://barc.wi.mit.edu/education/hot_topics/Intro_to_R_2015/tumors_wt_ko.txt", header=T)
tumors
##   wt ko
## 1  5  8
## 2  6  9
## 3  7 11

Our data looks OK: we have one (labeled) column for each group of mice.

###4 Accessing a data matrix How do we access subsets of tumors? We can use column names or row and column numbers.

tumors$wt
## [1] 5 6 7
tumors$ko
## [1]  8  9 11
tumors[1:3,1]
## [1] 5 6 7
tumors[1:2,1:2]
##   wt ko
## 1  5  8
## 2  6  9
tumors[,1]
## [1] 5 6 7
t.test(tumors$wt, tumors$ko)
## 
##  Welch Two Sample t-test
## 
## data:  tumors$wt and tumors$ko
## t = -3.1623, df = 3.4483, p-value = 0.04191
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -6.4542173 -0.2124494
## sample estimates:
## mean of x mean of y 
##  6.000000  9.333333

###5 Creating an output table Unless we have just a little data, we may want to run several analyses and create a le to hold the output. One way to do that is to create a table the hold the output data. In this case, let’s say we want to try two variations of two statistical tests. We should note that normally we’d decide on a statistical test when designing an experiment, rather than trying a bunch of them after we already have the data. Let’s start by creating an empty matrix of two rows and two columns, labeled to show what we want each cell to hold.

pvals.out = matrix(data=NA, ncol=2, nrow=2)
colnames(pvals.out) = c("two.tail", "one.tail")
rownames(pvals.out) = c("Welch", "Wilcoxon")
pvals.out
##          two.tail one.tail
## Welch          NA       NA
## Wilcoxon       NA       NA

Now let’s run some statistical tests and put the output p-values into our table. We’ll start with Welch’s test for the first row, once as a two-sided test and once as a one-sided test.

pvals.out[1,1] = t.test(tumors$wt, tumors$ko)$p.value
pvals.out[1,2] = t.test(tumors$wt, tumors$ko, alt="less")$p.value

Now let’s try the Wilcoxon rank sum test, a non-parametric alternative to the t-test.

pvals.out[2,1] = wilcox.test(tumors$wt, tumors$ko)$p.value
pvals.out[2,2] = wilcox.test(tumors$wt, tumors$ko, alt="less")$p.value
pvals.out
##            two.tail   one.tail
## Welch    0.04191452 0.02095726
## Wilcoxon 0.10000000 0.05000000

Since we have row names, we can also use them to access subsets of our matrix

pvals.out["Welch",]
##   two.tail   one.tail 
## 0.04191452 0.02095726

###6 Printing an output table Our table looks fine, except for the fact that we have so many digits that some are meaningless. We could either round them now or later. If we choose to do it now, set the number of significant figures:

pvals.out.rounded = signif(pvals.out, 3)
pvals.out.rounded
##          two.tail one.tail
## Welch      0.0419    0.021
## Wilcoxon   0.1000    0.050

Now let’s print the table.

write.table(pvals.out.rounded, file="tumor_pvals.txt", sep="\t", quote=F)

We’ve included our output filename, that we want the file to be tab-delimited (sep=“”), and that we don’t want quotes around any text (quote=F). Now we can open “Tumor_pvals.txt” in any text editor or Excel. If we do so in Excel, however, our column labels will need to be shifted over one column. To avoid this, we could have created a 3-column table, with a first column containing “Welch” and “Wilcoxon” { and then print the table with the additional option row.names=F.

###7 Creating figures R is very powerful and very exible with its figure generation. Besides statistics, graphics are another great reason to learn R. We can start by creating a simple boxplot of our data.

boxplot(tumors)

Normally if we only had three points of data, a scatterplot would be more informative than a boxplot. Nevertheless, let’s add some more details the make the figure more readable.

boxplot(tumors, col=c("gray", "red"), main="MFG appears to be a tumor suppressor", ylab="number of tumors")

If we run R on our computer, a figure will pop up and can then be saved by right-clicking on it. On tak, a figure will end up in a file called “Rplots.pdf”.

Normally it’s more useful to include filenames in our code, as this makes it more reproducible and more automated. To do this, we need to first issue a command telling R what graphical format to use. Then we write our plotting command(s). A PDF file can hold multiple figures, by default one per page. Finally, we need to “close” the file when we’re finished with it.

#Create a PDF file 11 inches wide and 8.5 inches high
pdf("tumor_boxplot.pdf", w=11, h=8.5)
boxplot(tumors)
dev.off()
## png 
##   2

Another common format is PNG. The syntax is similar, except the figure dimensions are in pixels rather than inches.

png("tumor_boxplot.png", w=1800, h=1200)
boxplot(tumors)
dev.off()
## png 
##   2

###8 Accessing very low p-values R has an unexpected habit of rounding very low p-values to “< 2.2e-16”, but we can do better than that approximation (if desired). Let’s start by creating some very extreme values.

a = 1:10
a
##  [1]  1  2  3  4  5  6  7  8  9 10
b = a + 1000
b
##  [1] 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010

Now let’s run Welch’s two sample t-test. Then let’s run the same test but explicitly ask for the p-value.

t.test(a,b)
## 
##  Welch Two Sample t-test
## 
## data:  a and b
## t = -738.55, df = 18, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1002.8447  -997.1553
## sample estimates:
## mean of x mean of y 
##       5.5    1005.5
t.test(a,b)$p.value
## [1] 8.605537e-42

Note that we get a much more precise p-value using the second commmand. We may want to round the value with a modified command like,

signif(t.test(a,b)$p.value, 1)
## [1] 9e-42

###9 Ending a session Let’s save all of our commands. We may want to edit out the not-so-good ones later, but at least we have them all.

#savehistory(file="R_intro_Hot_Topics_commands.R")
#Note: savehistory will not work when knit'd

What variables do we have so far?

ls()
## [1] "a"                 "b"                 "ko"               
## [4] "pvals.out"         "pvals.out.rounded" "tumors"           
## [7] "wt"                "wt.vs.ko"

And what files have we created?

dir(pattern="tumor*")
## [1] "tumor_boxplot.pdf" "tumor_boxplot.png" "tumor_pvals.txt"

In the interests of totally reproducible research, we’ll end by printing our R environment.

date()
## [1] "Thu Nov 14 11:19:21 2024"
print(sessionInfo(), locale=FALSE)
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.6 LTS
## 
## Matrix products: default
## BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.33     R6_2.5.1          lifecycle_1.0.4   jsonlite_1.8.8   
##  [5] evaluate_0.23     highr_0.10        cachem_1.0.8      rlang_1.1.2      
##  [9] cli_3.6.1         rstudioapi_0.15.0 jquerylib_0.1.4   bslib_0.6.1      
## [13] rmarkdown_2.25    tools_4.2.1       xfun_0.41         yaml_2.3.7       
## [17] fastmap_1.1.1     compiler_4.2.1    htmltools_0.5.7   knitr_1.45       
## [21] sass_0.4.8

This is just a start! There’s lots more….

To exit R, type q()

If you save the workspace image, you can start R again in about the same state as you are now. Then when you re-start R in the same directory, files called .RData and .Rhistory will reload of your data and command history.

###Part II: tidyverse

Tidyverse is suite of packages, written primarily by Hadley Wickham, for data analysis. We’ll use three packages: readr, tidyr and dplyr, which can make accessing/analyzing data easier than base R.

We’ll first load the packages,

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readr)
library(tidyr)

Read in the data file,

data<-read_tsv("http://barc.wi.mit.edu/education/hot_topics/Intro_to_R_2016/normalizedCounts_subset.txt")
## Rows: 11 Columns: 31
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr  (1): Gene
## dbl (30): HMLE_1, HMLE_2, HMLE_3, N8_1, N8_2, N8_3, N8_lo_1, N8_lo_2, N8_lo_...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(data)
## # A tibble: 6 × 31
##   Gene    HMLE_1 HMLE_2 HMLE_3  N8_1  N8_2  N8_3 N8_lo_1 N8_lo_2 N8_lo_3 N8_lo_4
##   <chr>    <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 C1orf43  2639.  2637   2140. 1974. 2325. 2721.  12383.  11561.  12654.  12025.
## 2 CHMP2A    396.   375.   308.  386.  390.  401.   3746.   3334.   3474.   3485.
## 3 EMC7      627.   659.   498.  670.  939.  624.   3926.   3887.   4368.   3674.
## 4 GPI      5739.  5868.  7034. 5778. 4972. 5461.  24078.  18815.  21059.  23859.
## 5 PSMB2    2770.  3052.  2986. 2989. 2574. 2113.  10581.  10733.  11364.  10265.
## 6 PSMB4    2844.  3279.  2797. 3216. 3303. 2677.  26545.  23818.  27669.  26145.
## # ℹ 20 more variables: N8_lo_5 <dbl>, N8_lo_6 <dbl>, N8_hi_1 <dbl>,
## #   N8_hi_2 <dbl>, N8_hi_3 <dbl>, N8_hi_4 <dbl>, N8_hi_5 <dbl>, N8_hi_6 <dbl>,
## #   `2020_159_hi` <dbl>, `2021_159_lo` <dbl>, `2051_159_hi` <dbl>,
## #   `2052_159_lo` <dbl>, `2064_159_lo` <dbl>, `2065_159_hi` <dbl>,
## #   `2479_231_lo` <dbl>, `2480_231_hi` <dbl>, `2499_231_lo` <dbl>,
## #   `2500_231_hi` <dbl>, `2513_231_hi` <dbl>, `2514_231_lo` <dbl>

We’ll now access/query certain columns using dplyr and show its functionality,

#columns 1 to 4
data_HMLE<- select(data, 1:4)
data_HMLE <- data %>% select(1:4)
head(data_HMLE)
## # A tibble: 6 × 4
##   Gene    HMLE_1 HMLE_2 HMLE_3
##   <chr>    <dbl>  <dbl>  <dbl>
## 1 C1orf43  2639.  2637   2140.
## 2 CHMP2A    396.   375.   308.
## 3 EMC7      627.   659.   498.
## 4 GPI      5739.  5868.  7034.
## 5 PSMB2    2770.  3052.  2986.
## 6 PSMB4    2844.  3279.  2797.
#only column "Gene"
data_rowNames <- select(data,Gene)
head(data_rowNames)
## # A tibble: 6 × 1
##   Gene   
##   <chr>  
## 1 C1orf43
## 2 CHMP2A 
## 3 EMC7   
## 4 GPI    
## 5 PSMB2  
## 6 PSMB4
#rename Column Gene to Gens
data_rowNames <- select(data,Genes=Gene)
head(data_rowNames)
## # A tibble: 6 × 1
##   Genes  
##   <chr>  
## 1 C1orf43
## 2 CHMP2A 
## 3 EMC7   
## 4 GPI    
## 5 PSMB2  
## 6 PSMB4
#Columns Gene to N8_3 (by column name)
data_range <- select(data, Gene:N8_3)
head(data_range)
## # A tibble: 6 × 7
##   Gene    HMLE_1 HMLE_2 HMLE_3  N8_1  N8_2  N8_3
##   <chr>    <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1 C1orf43  2639.  2637   2140. 1974. 2325. 2721.
## 2 CHMP2A    396.   375.   308.  386.  390.  401.
## 3 EMC7      627.   659.   498.  670.  939.  624.
## 4 GPI      5739.  5868.  7034. 5778. 4972. 5461.
## 5 PSMB2    2770.  3052.  2986. 2989. 2574. 2113.
## 6 PSMB4    2844.  3279.  2797. 3216. 3303. 2677.
#All columns except Gene
data_noGene <- select(data,-Gene)
head(data_noGene)
## # A tibble: 6 × 30
##   HMLE_1 HMLE_2 HMLE_3  N8_1  N8_2  N8_3 N8_lo_1 N8_lo_2 N8_lo_3 N8_lo_4 N8_lo_5
##    <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1  2639.  2637   2140. 1974. 2325. 2721.  12383.  11561.  12654.  12025.  11570.
## 2   396.   375.   308.  386.  390.  401.   3746.   3334.   3474.   3485.   3196.
## 3   627.   659.   498.  670.  939.  624.   3926.   3887.   4368.   3674.   3862.
## 4  5739.  5868.  7034. 5778. 4972. 5461.  24078.  18815.  21059.  23859.  18633.
## 5  2770.  3052.  2986. 2989. 2574. 2113.  10581.  10733.  11364.  10265.  10857.
## 6  2844.  3279.  2797. 3216. 3303. 2677.  26545.  23818.  27669.  26145.  23917.
## # ℹ 19 more variables: N8_lo_6 <dbl>, N8_hi_1 <dbl>, N8_hi_2 <dbl>,
## #   N8_hi_3 <dbl>, N8_hi_4 <dbl>, N8_hi_5 <dbl>, N8_hi_6 <dbl>,
## #   `2020_159_hi` <dbl>, `2021_159_lo` <dbl>, `2051_159_hi` <dbl>,
## #   `2052_159_lo` <dbl>, `2064_159_lo` <dbl>, `2065_159_hi` <dbl>,
## #   `2479_231_lo` <dbl>, `2480_231_hi` <dbl>, `2499_231_lo` <dbl>,
## #   `2500_231_hi` <dbl>, `2513_231_hi` <dbl>, `2514_231_lo` <dbl>
#select columns with "lo" and calculate the mean
#?contains
data_lo <- select(data, contains("lo"))
head(data_lo)
## # A tibble: 6 × 12
##   N8_lo_1 N8_lo_2 N8_lo_3 N8_lo_4 N8_lo_5 N8_lo_6 `2021_159_lo` `2052_159_lo`
##     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>         <dbl>         <dbl>
## 1  12383.  11561.  12654.  12025.  11570.  12584.         4041.         6757.
## 2   3746.   3334.   3474.   3485.   3196.   3505.         1114.         1672.
## 3   3926.   3887.   4368.   3674.   3862.   4361.         1269.         1512.
## 4  24078.  18815.  21059.  23859.  18633.  21224.         7659.         5312.
## 5  10581.  10733.  11364.  10265.  10857.  11163.         3630.         2144.
## 6  26545.  23818.  27669.  26145.  23917.  27725.         3967.         3550 
## # ℹ 4 more variables: `2064_159_lo` <dbl>, `2479_231_lo` <dbl>,
## #   `2499_231_lo` <dbl>, `2514_231_lo` <dbl>
data_lo <- select(data, matches('lo|Gene'))
head(data_lo)
## # A tibble: 6 × 13
##   Gene    N8_lo_1 N8_lo_2 N8_lo_3 N8_lo_4 N8_lo_5 N8_lo_6 `2021_159_lo`
##   <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>         <dbl>
## 1 C1orf43  12383.  11561.  12654.  12025.  11570.  12584.         4041.
## 2 CHMP2A    3746.   3334.   3474.   3485.   3196.   3505.         1114.
## 3 EMC7      3926.   3887.   4368.   3674.   3862.   4361.         1269.
## 4 GPI      24078.  18815.  21059.  23859.  18633.  21224.         7659.
## 5 PSMB2    10581.  10733.  11364.  10265.  10857.  11163.         3630.
## 6 PSMB4    26545.  23818.  27669.  26145.  23917.  27725.         3967.
## # ℹ 5 more variables: `2052_159_lo` <dbl>, `2064_159_lo` <dbl>,
## #   `2479_231_lo` <dbl>, `2499_231_lo` <dbl>, `2514_231_lo` <dbl>
data_lo_mean <- select(data_lo,-Gene) %>% mutate(average=rowMeans(.))
data_lo_mean
## # A tibble: 11 × 13
##    N8_lo_1 N8_lo_2 N8_lo_3 N8_lo_4 N8_lo_5 N8_lo_6 `2021_159_lo` `2052_159_lo`
##      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>         <dbl>         <dbl>
##  1  12383.  11561.  12654.  12025.  11570.  12584.         4041.         6757.
##  2   3746.   3334.   3474.   3485.   3196.   3505.         1114.         1672.
##  3   3926.   3887.   4368.   3674.   3862.   4361.         1269.         1512.
##  4  24078.  18815.  21059.  23859.  18633.  21224.         7659.         5312.
##  5  10581.  10733.  11364.  10265.  10857.  11163.         3630.         2144.
##  6  26545.  23818.  27669.  26145.  23917.  27725.         3967.         3550 
##  7   4207.   4423.   4262.   4037.   4545.   4282.         5842.         4290.
##  8   5427.   5190.   5698.   5552.   5079.   5800.         2484.         2056.
##  9   1280.   1442.   1244.   1262.   1430.   1189.         1995.         1151.
## 10  20143.  19895.  21588.  20532.  19714.  21545.         9019.         3533.
## 11   1896.   2487.   2181.   1913.   2501.   2215.         1320.          620.
## # ℹ 5 more variables: `2064_159_lo` <dbl>, `2479_231_lo` <dbl>,
## #   `2499_231_lo` <dbl>, `2514_231_lo` <dbl>, average <dbl>

dplyr has function group_by and summarise which can be very helpful in grouping/summarising data

#Gather data to make it 'tidy'
data2<-data %>% gather(Sample, Value, HMLE_1:`2514_231_lo`)
knitr::kable(data)
Gene HMLE_1 HMLE_2 HMLE_3 N8_1 N8_2 N8_3 N8_lo_1 N8_lo_2 N8_lo_3 N8_lo_4 N8_lo_5 N8_lo_6 N8_hi_1 N8_hi_2 N8_hi_3 N8_hi_4 N8_hi_5 N8_hi_6 2020_159_hi 2021_159_lo 2051_159_hi 2052_159_lo 2064_159_lo 2065_159_hi 2479_231_lo 2480_231_hi 2499_231_lo 2500_231_hi 2513_231_hi 2514_231_lo
C1orf43 2638.63 2637.00 2139.90 1974.41 2324.54 2720.88 12383.21 11561.25 12653.96 12025.11 11569.77 12584.14 10748.37 11580.84 11820.45 10911.87 11559.48 12014.27 4721.38 4040.94 6221.58 6756.85 6799.84 6313.16 12554.60 9495.91 10355.72 9697.18 7085.38 14096.89
CHMP2A 396.12 374.64 308.01 385.71 390.24 401.35 3745.64 3333.62 3474.17 3485.49 3195.98 3505.24 3250.63 3447.26 3273.25 3254.42 3453.26 3220.14 1152.03 1114.13 1539.93 1671.94 1653.83 1525.45 3257.41 4617.31 2688.81 4625.35 2265.31 3231.16
EMC7 627.20 659.25 497.97 670.10 939.40 624.18 3926.17 3886.75 4368.18 3673.52 3861.96 4360.74 4224.40 4401.20 4717.53 4091.59 4324.93 4737.03 1036.89 1268.55 1031.23 1512.44 1301.88 1011.79 1932.92 1911.13 1600.03 1997.40 1689.02 1917.48
GPI 5739.40 5868.39 7034.34 5778.32 4972.38 5461.45 24077.59 18815.02 21058.83 23858.70 18632.68 21223.80 20728.80 14331.97 19301.18 21043.71 14461.49 19351.92 7651.48 7659.03 6225.22 5312.36 5250.23 5892.63 20824.48 19719.69 11163.31 17936.13 9238.90 16378.83
PSMB2 2770.12 3052.46 2986.46 2989.33 2574.36 2112.81 10581.38 10733.41 11363.88 10264.95 10856.66 11163.09 9971.91 10309.92 10846.72 10176.13 10261.09 11339.53 2992.39 3630.22 1982.40 2143.70 2273.16 2088.62 8792.90 9162.77 6029.00 9473.94 6233.50 8740.18
PSMB4 2843.84 3278.91 2797.41 3216.03 3302.58 2677.03 26545.47 23817.93 27668.51 26144.64 23917.14 27724.72 22598.26 24148.99 26886.67 22290.90 23938.99 26663.68 3796.48 3967.37 3411.72 3550.00 3617.75 3177.27 20726.20 18808.92 12314.58 18363.02 10926.18 19297.21
RAB7A 3471.04 3693.25 4161.31 3209.67 3022.29 2965.63 4206.61 4422.55 4261.99 4037.39 4545.31 4281.79 3386.90 4530.12 3187.38 3590.15 4400.86 3339.22 7583.52 5842.31 6789.95 4289.65 3540.02 5098.12 3665.75 4083.54 4963.89 4437.36 4479.49 5209.84
REEP5 2844.94 2601.21 2491.71 2723.11 3065.28 2949.52 5426.53 5189.75 5698.44 5552.06 5078.59 5800.45 4905.75 5547.76 5747.94 5083.06 5620.29 5901.87 1449.00 2483.75 1165.14 2055.53 1353.13 1071.66 3824.88 3912.77 2905.62 3903.41 2800.87 4275.58
SNRPD3 1580.10 1821.74 1493.00 1463.32 1213.16 642.52 1279.51 1441.60 1244.16 1262.23 1429.94 1188.51 1006.42 1501.44 1169.51 1062.90 1479.10 1104.44 1537.09 1994.60 1179.69 1150.76 800.43 914.23 654.06 684.94 782.03 731.07 1493.16 787.39
VCP 8162.37 7444.10 8229.02 7667.78 8802.89 7981.43 20142.66 19895.29 21588.35 20531.65 19714.13 21545.33 22224.94 19112.33 19903.97 22435.19 19222.26 20376.98 6020.01 9018.95 3701.36 3532.59 4122.61 3993.95 25282.36 23514.88 18529.14 23856.52 14987.10 19691.58
VPS29 2404.81 3198.40 1956.39 1900.82 2673.96 1332.03 1896.48 2487.22 2181.22 1913.36 2501.47 2214.82 1797.07 2789.05 2287.50 1668.64 2709.16 2210.61 806.61 1320.30 566.92 620.03 392.96 381.36 2130.65 1242.98 2666.09 1359.02 1741.01 2658.64
knitr::kable(data2)
Gene Sample Value
C1orf43 HMLE_1 2638.63
CHMP2A HMLE_1 396.12
EMC7 HMLE_1 627.20
GPI HMLE_1 5739.40
PSMB2 HMLE_1 2770.12
PSMB4 HMLE_1 2843.84
RAB7A HMLE_1 3471.04
REEP5 HMLE_1 2844.94
SNRPD3 HMLE_1 1580.10
VCP HMLE_1 8162.37
VPS29 HMLE_1 2404.81
C1orf43 HMLE_2 2637.00
CHMP2A HMLE_2 374.64
EMC7 HMLE_2 659.25
GPI HMLE_2 5868.39
PSMB2 HMLE_2 3052.46
PSMB4 HMLE_2 3278.91
RAB7A HMLE_2 3693.25
REEP5 HMLE_2 2601.21
SNRPD3 HMLE_2 1821.74
VCP HMLE_2 7444.10
VPS29 HMLE_2 3198.40
C1orf43 HMLE_3 2139.90
CHMP2A HMLE_3 308.01
EMC7 HMLE_3 497.97
GPI HMLE_3 7034.34
PSMB2 HMLE_3 2986.46
PSMB4 HMLE_3 2797.41
RAB7A HMLE_3 4161.31
REEP5 HMLE_3 2491.71
SNRPD3 HMLE_3 1493.00
VCP HMLE_3 8229.02
VPS29 HMLE_3 1956.39
C1orf43 N8_1 1974.41
CHMP2A N8_1 385.71
EMC7 N8_1 670.10
GPI N8_1 5778.32
PSMB2 N8_1 2989.33
PSMB4 N8_1 3216.03
RAB7A N8_1 3209.67
REEP5 N8_1 2723.11
SNRPD3 N8_1 1463.32
VCP N8_1 7667.78
VPS29 N8_1 1900.82
C1orf43 N8_2 2324.54
CHMP2A N8_2 390.24
EMC7 N8_2 939.40
GPI N8_2 4972.38
PSMB2 N8_2 2574.36
PSMB4 N8_2 3302.58
RAB7A N8_2 3022.29
REEP5 N8_2 3065.28
SNRPD3 N8_2 1213.16
VCP N8_2 8802.89
VPS29 N8_2 2673.96
C1orf43 N8_3 2720.88
CHMP2A N8_3 401.35
EMC7 N8_3 624.18
GPI N8_3 5461.45
PSMB2 N8_3 2112.81
PSMB4 N8_3 2677.03
RAB7A N8_3 2965.63
REEP5 N8_3 2949.52
SNRPD3 N8_3 642.52
VCP N8_3 7981.43
VPS29 N8_3 1332.03
C1orf43 N8_lo_1 12383.21
CHMP2A N8_lo_1 3745.64
EMC7 N8_lo_1 3926.17
GPI N8_lo_1 24077.59
PSMB2 N8_lo_1 10581.38
PSMB4 N8_lo_1 26545.47
RAB7A N8_lo_1 4206.61
REEP5 N8_lo_1 5426.53
SNRPD3 N8_lo_1 1279.51
VCP N8_lo_1 20142.66
VPS29 N8_lo_1 1896.48
C1orf43 N8_lo_2 11561.25
CHMP2A N8_lo_2 3333.62
EMC7 N8_lo_2 3886.75
GPI N8_lo_2 18815.02
PSMB2 N8_lo_2 10733.41
PSMB4 N8_lo_2 23817.93
RAB7A N8_lo_2 4422.55
REEP5 N8_lo_2 5189.75
SNRPD3 N8_lo_2 1441.60
VCP N8_lo_2 19895.29
VPS29 N8_lo_2 2487.22
C1orf43 N8_lo_3 12653.96
CHMP2A N8_lo_3 3474.17
EMC7 N8_lo_3 4368.18
GPI N8_lo_3 21058.83
PSMB2 N8_lo_3 11363.88
PSMB4 N8_lo_3 27668.51
RAB7A N8_lo_3 4261.99
REEP5 N8_lo_3 5698.44
SNRPD3 N8_lo_3 1244.16
VCP N8_lo_3 21588.35
VPS29 N8_lo_3 2181.22
C1orf43 N8_lo_4 12025.11
CHMP2A N8_lo_4 3485.49
EMC7 N8_lo_4 3673.52
GPI N8_lo_4 23858.70
PSMB2 N8_lo_4 10264.95
PSMB4 N8_lo_4 26144.64
RAB7A N8_lo_4 4037.39
REEP5 N8_lo_4 5552.06
SNRPD3 N8_lo_4 1262.23
VCP N8_lo_4 20531.65
VPS29 N8_lo_4 1913.36
C1orf43 N8_lo_5 11569.77
CHMP2A N8_lo_5 3195.98
EMC7 N8_lo_5 3861.96
GPI N8_lo_5 18632.68
PSMB2 N8_lo_5 10856.66
PSMB4 N8_lo_5 23917.14
RAB7A N8_lo_5 4545.31
REEP5 N8_lo_5 5078.59
SNRPD3 N8_lo_5 1429.94
VCP N8_lo_5 19714.13
VPS29 N8_lo_5 2501.47
C1orf43 N8_lo_6 12584.14
CHMP2A N8_lo_6 3505.24
EMC7 N8_lo_6 4360.74
GPI N8_lo_6 21223.80
PSMB2 N8_lo_6 11163.09
PSMB4 N8_lo_6 27724.72
RAB7A N8_lo_6 4281.79
REEP5 N8_lo_6 5800.45
SNRPD3 N8_lo_6 1188.51
VCP N8_lo_6 21545.33
VPS29 N8_lo_6 2214.82
C1orf43 N8_hi_1 10748.37
CHMP2A N8_hi_1 3250.63
EMC7 N8_hi_1 4224.40
GPI N8_hi_1 20728.80
PSMB2 N8_hi_1 9971.91
PSMB4 N8_hi_1 22598.26
RAB7A N8_hi_1 3386.90
REEP5 N8_hi_1 4905.75
SNRPD3 N8_hi_1 1006.42
VCP N8_hi_1 22224.94
VPS29 N8_hi_1 1797.07
C1orf43 N8_hi_2 11580.84
CHMP2A N8_hi_2 3447.26
EMC7 N8_hi_2 4401.20
GPI N8_hi_2 14331.97
PSMB2 N8_hi_2 10309.92
PSMB4 N8_hi_2 24148.99
RAB7A N8_hi_2 4530.12
REEP5 N8_hi_2 5547.76
SNRPD3 N8_hi_2 1501.44
VCP N8_hi_2 19112.33
VPS29 N8_hi_2 2789.05
C1orf43 N8_hi_3 11820.45
CHMP2A N8_hi_3 3273.25
EMC7 N8_hi_3 4717.53
GPI N8_hi_3 19301.18
PSMB2 N8_hi_3 10846.72
PSMB4 N8_hi_3 26886.67
RAB7A N8_hi_3 3187.38
REEP5 N8_hi_3 5747.94
SNRPD3 N8_hi_3 1169.51
VCP N8_hi_3 19903.97
VPS29 N8_hi_3 2287.50
C1orf43 N8_hi_4 10911.87
CHMP2A N8_hi_4 3254.42
EMC7 N8_hi_4 4091.59
GPI N8_hi_4 21043.71
PSMB2 N8_hi_4 10176.13
PSMB4 N8_hi_4 22290.90
RAB7A N8_hi_4 3590.15
REEP5 N8_hi_4 5083.06
SNRPD3 N8_hi_4 1062.90
VCP N8_hi_4 22435.19
VPS29 N8_hi_4 1668.64
C1orf43 N8_hi_5 11559.48
CHMP2A N8_hi_5 3453.26
EMC7 N8_hi_5 4324.93
GPI N8_hi_5 14461.49
PSMB2 N8_hi_5 10261.09
PSMB4 N8_hi_5 23938.99
RAB7A N8_hi_5 4400.86
REEP5 N8_hi_5 5620.29
SNRPD3 N8_hi_5 1479.10
VCP N8_hi_5 19222.26
VPS29 N8_hi_5 2709.16
C1orf43 N8_hi_6 12014.27
CHMP2A N8_hi_6 3220.14
EMC7 N8_hi_6 4737.03
GPI N8_hi_6 19351.92
PSMB2 N8_hi_6 11339.53
PSMB4 N8_hi_6 26663.68
RAB7A N8_hi_6 3339.22
REEP5 N8_hi_6 5901.87
SNRPD3 N8_hi_6 1104.44
VCP N8_hi_6 20376.98
VPS29 N8_hi_6 2210.61
C1orf43 2020_159_hi 4721.38
CHMP2A 2020_159_hi 1152.03
EMC7 2020_159_hi 1036.89
GPI 2020_159_hi 7651.48
PSMB2 2020_159_hi 2992.39
PSMB4 2020_159_hi 3796.48
RAB7A 2020_159_hi 7583.52
REEP5 2020_159_hi 1449.00
SNRPD3 2020_159_hi 1537.09
VCP 2020_159_hi 6020.01
VPS29 2020_159_hi 806.61
C1orf43 2021_159_lo 4040.94
CHMP2A 2021_159_lo 1114.13
EMC7 2021_159_lo 1268.55
GPI 2021_159_lo 7659.03
PSMB2 2021_159_lo 3630.22
PSMB4 2021_159_lo 3967.37
RAB7A 2021_159_lo 5842.31
REEP5 2021_159_lo 2483.75
SNRPD3 2021_159_lo 1994.60
VCP 2021_159_lo 9018.95
VPS29 2021_159_lo 1320.30
C1orf43 2051_159_hi 6221.58
CHMP2A 2051_159_hi 1539.93
EMC7 2051_159_hi 1031.23
GPI 2051_159_hi 6225.22
PSMB2 2051_159_hi 1982.40
PSMB4 2051_159_hi 3411.72
RAB7A 2051_159_hi 6789.95
REEP5 2051_159_hi 1165.14
SNRPD3 2051_159_hi 1179.69
VCP 2051_159_hi 3701.36
VPS29 2051_159_hi 566.92
C1orf43 2052_159_lo 6756.85
CHMP2A 2052_159_lo 1671.94
EMC7 2052_159_lo 1512.44
GPI 2052_159_lo 5312.36
PSMB2 2052_159_lo 2143.70
PSMB4 2052_159_lo 3550.00
RAB7A 2052_159_lo 4289.65
REEP5 2052_159_lo 2055.53
SNRPD3 2052_159_lo 1150.76
VCP 2052_159_lo 3532.59
VPS29 2052_159_lo 620.03
C1orf43 2064_159_lo 6799.84
CHMP2A 2064_159_lo 1653.83
EMC7 2064_159_lo 1301.88
GPI 2064_159_lo 5250.23
PSMB2 2064_159_lo 2273.16
PSMB4 2064_159_lo 3617.75
RAB7A 2064_159_lo 3540.02
REEP5 2064_159_lo 1353.13
SNRPD3 2064_159_lo 800.43
VCP 2064_159_lo 4122.61
VPS29 2064_159_lo 392.96
C1orf43 2065_159_hi 6313.16
CHMP2A 2065_159_hi 1525.45
EMC7 2065_159_hi 1011.79
GPI 2065_159_hi 5892.63
PSMB2 2065_159_hi 2088.62
PSMB4 2065_159_hi 3177.27
RAB7A 2065_159_hi 5098.12
REEP5 2065_159_hi 1071.66
SNRPD3 2065_159_hi 914.23
VCP 2065_159_hi 3993.95
VPS29 2065_159_hi 381.36
C1orf43 2479_231_lo 12554.60
CHMP2A 2479_231_lo 3257.41
EMC7 2479_231_lo 1932.92
GPI 2479_231_lo 20824.48
PSMB2 2479_231_lo 8792.90
PSMB4 2479_231_lo 20726.20
RAB7A 2479_231_lo 3665.75
REEP5 2479_231_lo 3824.88
SNRPD3 2479_231_lo 654.06
VCP 2479_231_lo 25282.36
VPS29 2479_231_lo 2130.65
C1orf43 2480_231_hi 9495.91
CHMP2A 2480_231_hi 4617.31
EMC7 2480_231_hi 1911.13
GPI 2480_231_hi 19719.69
PSMB2 2480_231_hi 9162.77
PSMB4 2480_231_hi 18808.92
RAB7A 2480_231_hi 4083.54
REEP5 2480_231_hi 3912.77
SNRPD3 2480_231_hi 684.94
VCP 2480_231_hi 23514.88
VPS29 2480_231_hi 1242.98
C1orf43 2499_231_lo 10355.72
CHMP2A 2499_231_lo 2688.81
EMC7 2499_231_lo 1600.03
GPI 2499_231_lo 11163.31
PSMB2 2499_231_lo 6029.00
PSMB4 2499_231_lo 12314.58
RAB7A 2499_231_lo 4963.89
REEP5 2499_231_lo 2905.62
SNRPD3 2499_231_lo 782.03
VCP 2499_231_lo 18529.14
VPS29 2499_231_lo 2666.09
C1orf43 2500_231_hi 9697.18
CHMP2A 2500_231_hi 4625.35
EMC7 2500_231_hi 1997.40
GPI 2500_231_hi 17936.13
PSMB2 2500_231_hi 9473.94
PSMB4 2500_231_hi 18363.02
RAB7A 2500_231_hi 4437.36
REEP5 2500_231_hi 3903.41
SNRPD3 2500_231_hi 731.07
VCP 2500_231_hi 23856.52
VPS29 2500_231_hi 1359.02
C1orf43 2513_231_hi 7085.38
CHMP2A 2513_231_hi 2265.31
EMC7 2513_231_hi 1689.02
GPI 2513_231_hi 9238.90
PSMB2 2513_231_hi 6233.50
PSMB4 2513_231_hi 10926.18
RAB7A 2513_231_hi 4479.49
REEP5 2513_231_hi 2800.87
SNRPD3 2513_231_hi 1493.16
VCP 2513_231_hi 14987.10
VPS29 2513_231_hi 1741.01
C1orf43 2514_231_lo 14096.89
CHMP2A 2514_231_lo 3231.16
EMC7 2514_231_lo 1917.48
GPI 2514_231_lo 16378.83
PSMB2 2514_231_lo 8740.18
PSMB4 2514_231_lo 19297.21
RAB7A 2514_231_lo 5209.84
REEP5 2514_231_lo 4275.58
SNRPD3 2514_231_lo 787.39
VCP 2514_231_lo 19691.58
VPS29 2514_231_lo 2658.64
#select values greater than 10000
data2_genesHigh <- filter(data2, Value > 10000)
data2_genesHigh %>% select(Gene) %>% distinct()
## # A tibble: 5 × 1
##   Gene   
##   <chr>  
## 1 C1orf43
## 2 GPI    
## 3 PSMB2  
## 4 PSMB4  
## 5 VCP
#Spread (opposite of gather)
data2_genesHighSpread <- spread(data2_genesHigh,Sample,Value)
data2_genesHighSpread
## # A tibble: 5 × 19
##   Gene    `2479_231_lo` `2480_231_hi` `2499_231_lo` `2500_231_hi` `2513_231_hi`
##   <chr>           <dbl>         <dbl>         <dbl>         <dbl>         <dbl>
## 1 C1orf43        12555.           NA         10356.           NA            NA 
## 2 GPI            20824.        19720.        11163.        17936.           NA 
## 3 PSMB2             NA            NA            NA            NA            NA 
## 4 PSMB4          20726.        18809.        12315.        18363.        10926.
## 5 VCP            25282.        23515.        18529.        23857.        14987.
## # ℹ 13 more variables: `2514_231_lo` <dbl>, N8_hi_1 <dbl>, N8_hi_2 <dbl>,
## #   N8_hi_3 <dbl>, N8_hi_4 <dbl>, N8_hi_5 <dbl>, N8_hi_6 <dbl>, N8_lo_1 <dbl>,
## #   N8_lo_2 <dbl>, N8_lo_3 <dbl>, N8_lo_4 <dbl>, N8_lo_5 <dbl>, N8_lo_6 <dbl>
#group_by and summarise
data2_genesMean <- data2 %>% group_by(Gene) %>% summarise(average=mean(Value))
data2_genesMean
## # A tibble: 11 × 2
##    Gene    average
##    <chr>     <dbl>
##  1 C1orf43   8466.
##  2 CHMP2A    2408.
##  3 EMC7      2427.
##  4 GPI      13500.
##  5 PSMB2     6930.
##  6 PSMB4    14747.
##  7 RAB7A     4290.
##  8 REEP5     3781.
##  9 SNRPD3    1203.
## 10 VCP      15041.
## 11 VPS29     1867.

ggplot2 is a powerful popular graphics package that can be easily used with other tidyverse packages

library(ggplot2)

data2 %>% ggplot(aes(x=factor(Gene),y=Value))+geom_boxplot()+theme(axis.text.x = element_text(angle = 90, hjust = 1))

data2 %>% filter(Sample %in% c("HMLE_1","HMLE_2")) %>% ggplot(aes(x=factor(Gene),y=Value))+geom_point()+facet_grid(. ~ Sample)+theme(axis.text.x = element_text(angle = 90, hjust = 1)) + geom_line ()
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?