Adapted from https://www.broadinstitute.org/center-cell-circuits-computational-genomics-workshop by Inma Barrasa, BaRC, Whitehead Institute for Biomedical Research.
Follow these steps:
# setwd("/nfs/BaRC_training")
# dir.create("userName") #Use your userName (eg. jdoe)
File -> Open File… -> “SingleCell_Seurat_2021.Rmd”
use “Save As…” to save the Rmd to your directory/folder #cd to your folder # setwd(“/nfs/BaRC_training/userName”)
To give you experience with the analysis of single cell RNA sequencing (scRNA-seq) including performing quality control and identifying cell type subsets.
To introduce you to scRNA-seq analysis using the Seurat package.
We will be using the Seurat version 3. Commands are a bit different to Seurat v2. See https://satijalab.org/seurat/essential_commands.html for more detail.
Data produced in a single cell RNA-seq experiment has several interesting characteristics that make it distinct from data produced in a bulk population RNA-seq experiment. Two characteristics that are important to keep in mind when working with scRNA-Seq are drop-out (the excessive amount of zeros due to limiting mRNA) and the potential for quality control (QC) metrics to be confounded with biology. This combined with the ability to measure heterogeniety from cells in samples has shifted the field away from the typical analysis in population-based RNA-Seq. Here we demonstrate some approaches to quality control, followed by identifying and analyzing cell subsets.
For this tutorial, we will be analyzing the a dataset of Non-Small Cell Lung Cancer Cells (NSCLC) freely available from 10X Genomics (https://support.10xgenomics.com/single-cell-vdj/datasets/2.2.0/vdj_v1_hs_nsclc_5gex), using the Seurat R package (http://satijalab.org/seurat/), a popular and powerful set of tools to conduct scRNA-seq analysis in R. In this dataset, there are 7802 single cells that were sequenced on the Illumina NovaSeq 6000. Please note this tutorial borrows heavily from Seurat’s tutorials (https://satijalab.org/seurat/vignettes.html), so feel free to go through them in more detail.
When loading libraries, we are asking R to load code for us written by someone else. It is a convenient way to leverage and reproduce methodology developed by others.
library(Seurat)
library(dplyr)
library(Matrix)
library(gdata)
library(reshape)
library(ggplot2)
packageDescription("Seurat")$Version
## [1] "3.2.2"
The data for Non-Small Cell Lung Cancer Cells (NSCLC) is freely available from 10X Genomics (https://support.10xgenomics.com/single-cell-vdj/datasets/2.2.0/vdj_v1_hs_nsclc_5gex). We start by reading in the counts matrix generated by the Cell Ranger count program.
counts_matrix_filename = "/nfs/BaRC_training/SingleCellRNA-seq/filtered_gene_bc_matrices/GRCh38/"
counts <- Read10X(data.dir = counts_matrix_filename) # Seurat function to read in 10x count data
counts[1:10, 1:3]
## 10 x 3 sparse Matrix of class "dgCMatrix"
## AAACCTGAGCTAGTCT-1 AAACCTGAGGGCACTA-1 AAACCTGAGTACGTTC-1
## RP11-34P13.3 . . .
## FAM138A . . .
## OR4F5 . . .
## RP11-34P13.7 . . .
## RP11-34P13.8 . . .
## RP11-34P13.14 . . .
## RP11-34P13.9 . . .
## FO538757.3 . . .
## FO538757.2 . . .
## AP006222.2 . . .
Here we see the upper left corner of the sparse matrix. The columns are indexed by 10x cell barcodes (each 16 nt long), and the rows are the gene names. We mentioned these matrices are sparse, here we see only zeroes (indicated by the “.” symbol); this is the most common value in these sparse matrices. Next, let us look at the dimensions of this matrix.
dim(counts) # report number of genes (rows) and number of cells (columns)
## [1] 33694 7802
Here we see the counts matrix has 33694 genes and 7802 cells.
object.size(counts) # size in bytes
## [1] 169457000 bytes
object.size(as.matrix(counts)) # size in bytes
## [1] 2106010424 bytes
We see here that the sparse matrix takes 170 Mb in memory while storing the matrix in a dense format (where all count values including zeros are stored) takes 10 times as much memory! This memory saving is very important, especially as data sets are now being created that are beyond a million cells. These matrices can become unmanageable without special computing resources.
In the sparse representation, we assume that the majority of count values in a matrix are zero. We only store the non-zero values. This is implemented in the Matrix package using a dgTMatrix object.
You can learn a lot about your scRNA-seq data’s quality with simple plotting. Let’s do some plotting to look at the number of reads per cell, reads per genes, expressed genes per cell (often called complexity), and rarity of genes (cells expressing genes).
counts_per_cell <- Matrix::colSums(counts)
counts_per_gene <- Matrix::rowSums(counts)
genes_per_cell <- Matrix::colSums(counts>0) # count a gene only if it has non-zero reads mapped.
cells_per_gene <- Matrix::rowSums(counts>0) # only count cells where the gene is expressed
colSums and rowSums are functions that work on each row or column in a matrix and return the column sums or row sums as a vector. If this is true counts_per_cell should have 1 entry per cell. Let’s make sure the length of the returned vector matches the matrix dimension for column. How would you do that? ( Hint:length() ).
Notes: 1. Matrix::colSums is a way to force functions from the Matrix library to be used. There are many libraries that implement colSums, we are forcing the one from the Matrix library to be used here to make sure it handles the dgTmatrix (sparse matrix) correctly. This is good practice.
hist(log10(counts_per_cell+1),main='counts per cell',col='blue')
hist(log10(genes_per_cell+1), main='genes per cell', col='blue')
plot(counts_per_cell, genes_per_cell, log='xy', col='blue')
title('counts vs genes per cell')
hist(log10(counts_per_gene+1),main='counts per gene',col='blue')
Here we see examples of plotting a new plot, the histogram. R makes this really easy with the hist function. We are also transforming the values to log10 before plotting, this is done with the log10 method. When logging count data, the + 1 is used to avoid log10(0) which is not defined.
Here we rank each cell by its library complexity, ie the number of genes detected per cell. This is a very useful plot as it shows the distribution of library complexity in the sequencing run. One can use this plot to investigate observations (potential cells) that are actually failed libraries (lower end outliers) or observations that are cell doublets (higher end outliers).
plot(sort(genes_per_cell), xlab='cell', log='y', main='genes per cell (ordered)')
Notes: 1. Here we use a simple scatter plot, but we sort the library (cell) complexity before plotting. y here is the library complexity, and x is the cells ranked by complexity.
We do not want to be working with observations (cells) that are poor libraries or resulting from doublet cells so we are going to filter on both ends of the distribution. We can see inflection points around 350 and 5000. We will plot lines to view these values. Feel free to experiment with cutoffs and to plot them to see the results.
# set upper and lower thresholds for genes per cell:
MIN_GENES_PER_CELL <- 350 ## user-defined setting
MAX_GENES_PER_CELL <- 5000 ## user-defined setting
# now replot with the thresholds being shown:
plot(sort(genes_per_cell), xlab='cell', log='y', main='genes per cell (ordered)')
abline(h=MIN_GENES_PER_CELL, col='magenta') # lower threshold
abline(h=MAX_GENES_PER_CELL, col='gold') # upper threshold
These cutoffs should not be viewed as very exact but instead should be viewed as something you have a general range to select within.
The percent of reads in a cell coming from the mitochondrial genome is often used to measure the quality of cells. High fraction of mitochondrial counts can be indicative of cells whose cytoplasmic mRNA has leaked out through a broken membrane, and thus, only mRNA located in the mitochondria is still conserved. This is often attributed to cells getting damaged during tissue dissociation or during flow cytometry sorting. However, the specific biology of your system can also result in high mitochondrial reads, for example if you are studying cardiomyocytes. Therefore we must always keep the biology in mind when deciding how to implement quality control metrics. Below, we measure the percent of gene expression attributed to mitochondrial genes.
# define the mitochondrial genes
mito_genes <- grep("^mt-", rownames(counts) , ignore.case=T, value=T)
print(mito_genes)
## [1] "MT-ND1" "MT-ND2" "MT-CO1" "MT-CO2" "MT-ATP8" "MT-ATP6" "MT-CO3"
## [8] "MT-ND3" "MT-ND4L" "MT-ND4" "MT-ND5" "MT-ND6" "MT-CYB"
First we identify mitochondrial genes from the counts matrix. But how did we do this? We used the function “grep” which is a classic unix commandline tool for searching for substrings. It is so useful R has implemented this function. Please take a moment to read grep’s description. So here we took all the rownames from the matrix and searched through each row name for “^mt-” (“^” indicates beginning of the word followed by the letters “m” then “t” then “-”). This trick only works because we used a reference transcript (gtf file) that used this naming convention. This is a common trick to identify mitochondrial genes.
Feel free to try to search for other genes or gene families, make sure NOT to store the result as “mito_genes”. Please use another variable so the rest of the tutorial is not affected.
# compute pct mito
mito_gene_read_counts = Matrix::colSums(counts[mito_genes,])
pct_mito = mito_gene_read_counts / counts_per_cell * 100
plot(sort(pct_mito), xlab = "cells sorted by percentage mitochondrial counts", ylab = "percentage mitochondrial counts")
Now that we have the mitochondrial genes, we select them from the matrix and calculate the total counts in each cell coming from those genes. We then take those results and the fraction of those mitochondrial counts in the relation to all counts to calculate a percentage of mitochondrial counts.
We plot them as sorted scatter plots. In this case only the upper end of the distribution is filtered as we are concerned with unusually high percent mitochondrial counts.
MAX_PCT_MITO <- 10 ## user-defined setting
plot(sort(pct_mito))
abline(h=MAX_PCT_MITO, col='red')
Here we plot the percent mitochondrial read cutoff to visualize the filtering point.
One very popular product for 3’ single cell transcriptomics sequencing is the 10X assay. Sequencing data from 10X can be transformed to gene count matrices using the Cell Ranger pipeline. Here we go through many of the steps we saw earlier but using the Seurat R library. The Seurat R library makes available many useful functionalities. We will be using the same data as before.
counts <- Read10X(data.dir = counts_matrix_filename)
Seurat works with the count matrix in the sparse format, which is far more memory efficient as we observed earlier.
seurat <- CreateSeuratObject(counts, min.cells = 3, min.features = 350, project = "10X_NSCLC")
## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')
Almost all our analysis will be on the single object, of class Seurat, created above. This object contains various “slots” that will store not only the raw count data, but also the results from various computations below. This has the advantage that we do not need to keep track of inidividual variables of interest - they can all be collapsed into a single object as long as these slots are pre-defined. This object was made with the CreateSeuratObject(). Here we are filtering out genes that are are expressed in 2 or fewer cells and are filtering cells with complexity less than 350 genes.
The “GetAssayData” function allows us to get the counts/raw data stored on the “counts” slot. As we proceed on our analysis other slots like “data” (normalized data) and “scale.data” (data scaled by gene) will be populated.
GetAssayData(object = seurat, slot = "counts")[1:10,1:10]
## 10 x 10 sparse Matrix of class "dgCMatrix"
## [[ suppressing 10 column names 'AAACCTGAGCTAGTCT-1', 'AAACCTGAGGGCACTA-1', 'AAACCTGAGTACGTTC-1' ... ]]
##
## FO538757.2 . . . . 1 2 . . . .
## AP006222.2 . . . . . . . . . .
## RP4-669L17.10 . . . . . . . . . .
## RP11-206L10.9 . . . . . . . . . .
## LINC00115 . . . . . . . . . .
## FAM41C . . . . . . . . . .
## SAMD11 . . . . . 1 . . . .
## NOC2L . . . . 2 4 . 1 . .
## KLHL17 . . . . . . . . . .
## PLEKHN1 . . . . . . . . . .
The Seurat object initialization step above only considered cells that expressed at least 350 genes. Additionally, we would like to exclude cells that are damaged. A common metric to judge this (although by no means the only one) is the relative expression of mitochondrially derived genes.
Here we calculate the percent mitochondrial reads and add it to the Seurat object in the slot named meta.data, using the function “PercentageFeatureSet”. Then we plot the mit percentage using the violin plot function provided by Seurat.
# The number of genes and UMIs (nFeature_RNA and nCount_RNA) are automatically calculated for every object by Seurat.
# this is step adds the % mit content to the metadata
seurat[["percent.mt"]] <- PercentageFeatureSet(seurat, pattern = "^MT-")
head(seurat@meta.data)
## orig.ident nCount_RNA nFeature_RNA percent.mt
## AAACCTGAGCTAGTCT-1 10X_NSCLC 3605 1184 20.832178
## AAACCTGAGGGCACTA-1 10X_NSCLC 3828 1387 3.448276
## AAACCTGAGTACGTTC-1 10X_NSCLC 6457 1784 2.865108
## AAACCTGAGTCCGGTC-1 10X_NSCLC 3075 1092 6.276423
## AAACCTGCACCAGGTC-1 10X_NSCLC 9399 2625 3.202468
## AAACCTGCACCTCGTT-1 10X_NSCLC 48838 5602 2.321962
VlnPlot(seurat, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3, pt.size = 0.1) + NoLegend()
# FeatureScatter is typically used to visualize feature-feature relationships, but can be used
# for anything calculated by the object, i.e. columns in object metadata, PC scores etc.
plot1 <- FeatureScatter(seurat, feature1 = "nCount_RNA", feature2 = "percent.mt") + NoLegend()
plot2 <- FeatureScatter(seurat, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") + NoLegend()
CombinePlots(plots = list(plot1, plot2))
## Warning: CombinePlots is being deprecated. Plots should now be combined using
## the patchwork system.
FeatureScatter is typically used to visualize gene-gene relationships, but can be used for anything calculated by the object, i.e. columns in object@meta.data, PC scores etc. Since there is a rare subset of cells with an outlier level of high mitochondrial percentage and also low UMI content, we filter these as well. The title of these plots is the Pearson correlation between the two variables: percent.mito vs nCount_RNA or nFeature_RNA (number of genes) vs nCount_RNA.
str(seurat)
## Formal class 'Seurat' [package "Seurat"] with 13 slots
## ..@ assays :List of 1
## .. ..$ RNA:Formal class 'Assay' [package "Seurat"] with 8 slots
## .. .. .. ..@ counts :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. ..@ i : int [1:13732088] 22 37 100 102 105 132 146 171 177 185 ...
## .. .. .. .. .. ..@ p : int [1:7110] 0 1184 2571 4355 5447 8072 13674 15861 17504 18605 ...
## .. .. .. .. .. ..@ Dim : int [1:2] 20213 7109
## .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. ..$ : chr [1:20213] "FO538757.2" "AP006222.2" "RP4-669L17.10" "RP11-206L10.9" ...
## .. .. .. .. .. .. ..$ : chr [1:7109] "AAACCTGAGCTAGTCT-1" "AAACCTGAGGGCACTA-1" "AAACCTGAGTACGTTC-1" "AAACCTGAGTCCGGTC-1" ...
## .. .. .. .. .. ..@ x : num [1:13732088] 1 2 7 1 1 3 1 1 1 1 ...
## .. .. .. .. .. ..@ factors : list()
## .. .. .. ..@ data :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. ..@ i : int [1:13732088] 22 37 100 102 105 132 146 171 177 185 ...
## .. .. .. .. .. ..@ p : int [1:7110] 0 1184 2571 4355 5447 8072 13674 15861 17504 18605 ...
## .. .. .. .. .. ..@ Dim : int [1:2] 20213 7109
## .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. ..$ : chr [1:20213] "FO538757.2" "AP006222.2" "RP4-669L17.10" "RP11-206L10.9" ...
## .. .. .. .. .. .. ..$ : chr [1:7109] "AAACCTGAGCTAGTCT-1" "AAACCTGAGGGCACTA-1" "AAACCTGAGTACGTTC-1" "AAACCTGAGTCCGGTC-1" ...
## .. .. .. .. .. ..@ x : num [1:13732088] 1 2 7 1 1 3 1 1 1 1 ...
## .. .. .. .. .. ..@ factors : list()
## .. .. .. ..@ scale.data : num[0 , 0 ]
## .. .. .. ..@ key : chr "rna_"
## .. .. .. ..@ assay.orig : NULL
## .. .. .. ..@ var.features : logi(0)
## .. .. .. ..@ meta.features:'data.frame': 20213 obs. of 0 variables
## .. .. .. ..@ misc : NULL
## ..@ meta.data :'data.frame': 7109 obs. of 4 variables:
## .. ..$ orig.ident : Factor w/ 1 level "10X_NSCLC": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ nCount_RNA : num [1:7109] 3605 3828 6457 3075 9399 ...
## .. ..$ nFeature_RNA: int [1:7109] 1184 1387 1784 1092 2625 5602 2187 1643 1101 2717 ...
## .. ..$ percent.mt : num [1:7109] 20.83 3.45 2.87 6.28 3.2 ...
## ..@ active.assay: chr "RNA"
## ..@ active.ident: Factor w/ 1 level "10X_NSCLC": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..- attr(*, "names")= chr [1:7109] "AAACCTGAGCTAGTCT-1" "AAACCTGAGGGCACTA-1" "AAACCTGAGTACGTTC-1" "AAACCTGAGTCCGGTC-1" ...
## ..@ graphs : list()
## ..@ neighbors : list()
## ..@ reductions : list()
## ..@ images : list()
## ..@ project.name: chr "10X_NSCLC"
## ..@ misc : list()
## ..@ version :Classes 'package_version', 'numeric_version' hidden list of 1
## .. ..$ : int [1:3] 3 2 2
## ..@ commands : list()
## ..@ tools : list()
These are the slots in the Seurat object. Some of the slots are automatically updated by Seurat as you move through analysis. Take a moment to look through the information, knowing the slots allow you to leverage work Seurat has already done for you.
VlnPlot(object = seurat, features = c("nFeature_RNA"), group.by = c('orig.ident'), pt.size = 0.1) +NoLegend()
Here we plot the number of genes per cell by what Seurat calls orig.ident. Identity is a concept that is used in the Seurat object to refer to the cell identity. In this case, the cell identity is 10X_NSCLC, but after we cluster the cells, the cell identity will be whatever cluster the cell belongs to. We will see how identity updates as we go throught the analysis.
Next, let’s filter the cells based on the quality control metrics.
seurat <- subset(seurat, subset = nFeature_RNA > 350 & nFeature_RNA < 5000 & percent.mt < 10)
After removing unwanted genes cells from the dataset, the next step is to normalize the data. By default, we employ a global-scaling normalization method “LogNormalize” that normalizes the gene expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result. The log-transformation is a commonly used transformation that has many desirable properties, such as variance stabilization. Seurat uses natural log.
seurat <- NormalizeData(seurat, normalization.method = "LogNormalize", scale.factor = 10000)
Seurat calculates highly variable genes and focuses on these for downstream analysis. FindVariableFeatures
, by default, it returns 2,000 features per dataset. MOre detailed can be found here: https://www.ncbi.nlm.nih.gov/pubmed/31178118
seurat <- FindVariableFeatures(seurat, selection.method = "vst", nfeatures = 2000)
length(VariableFeatures(seurat))
## [1] 2000
# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(seurat), 10)
# plot variable features with and without labels
plot1 <- VariableFeaturePlot(seurat)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
## When using repel, set xnudge and ynudge to 0 for optimal results
#CombinePlots(plots = list(plot1, plot2))
plot1
## Warning: Transformation introduced infinite values in continuous x-axis
plot2
## Warning: Transformation introduced infinite values in continuous x-axis
Next, we apply a linear transformation (‘scaling’) that is a standard pre-processing step prior to dimensional reduction techniques like PCA. The ScaleData function:
Shifts the expression of each gene, so that the mean expression across cells is 0
Scales the expression of each gene, so that the variance across cells is 1
This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate
The results of this are stored in seurat[["RNA"]]@scale.data
all.genes <- rownames(seurat)
seurat <- ScaleData(seurat, features = all.genes)
## Centering and scaling data matrix
seurat <- RunPCA(seurat, features = VariableFeatures(object = seurat))
## PC_ 1
## Positive: KRT7, TACSTD2, SLPI, FXYD3, CLDN7, S100P, MPZL2, PTGES, KRT19, SERINC2
## ELF3, DSP, C19orf33, KRT17, LCN2, DDR1, KLF5, GPRC5A, KRT18, MDK
## S100A14, S100A16, ID1, TRIM29, IFITM3, TMPRSS4, CLDN4, SPINT1, SERPINB3, PERP
## Negative: CD74, HLA-DPA1, HLA-DPB1, HLA-DQA1, HLA-DRB1, HLA-DRA, HERPUD1, IGHA1, TNF, PIM2
## ITM2A, CCL5, CD7, KIAA0125, PLAC8, GZMA, STAG3, KLRB1, AC079767.4, IGKC
## DERL3, CTSW, C12orf75, BATF, JCHAIN, GZMK, CST7, HOPX, HLA-DRB5, TRBC1
## PC_ 2
## Positive: AIF1, LYZ, TYROBP, FCER1G, SERPINA1, FCGR2A, MS4A6A, CD14, CD68, CSF1R
## CPVL, FCGRT, FPR1, MAFB, SPI1, CLEC7A, CD163, RNF130, PILRA, C1orf162
## IGSF6, PLXDC2, CST3, LILRB2, SLC11A1, FTL, NPC2, LST1, MNDA, NCF2
## Negative: RPS18, PERP, AQP3, KRT7, KRT17, CD24, FXYD3, DSP, C19orf33, SLPI
## S100A2, KRT18, PTGES, KLF5, S100P, S100A14, LCN2, S100A16, CDKN2A, ELF3
## TRIM29, KRT19, DDR1, SERPINB3, TACSTD2, GPRC5A, KRT6A, RAB25, MDK, TFAP2A
## PC_ 3
## Positive: TPSD1, CPA3, TPSAB1, TPSB2, MS4A2, GATA2, HPGDS, SLC18A2, KIT, HDC
## RP11-354E11.2, IL1RL1, CLU, VWA5A, MAOB, CNRIP1, RP11-32B5.7, HPGD, LTC4S, RGS13
## SLC45A3, GMPR, NTRK1, GALC, TMEM176B, ENPP3, SDPR, CALB2, CD9, TIMP3
## Negative: S100A8, S100A12, S100A9, SPRR1B, HLA-DRA, PDZK1IP1, NAMPT, CD74, SPRR2A, TMPRSS11D
## HLA-DRB1, RDH10, ADGRF1, ERO1A, SCEL, FCN1, CSTA, PRSS22, F3, HLA-DPA1
## APOBEC3A, MUC4, RAB31, RNF39, VCAN, HLA-DPB1, TMPRSS4, BCL2A1, CLEC4E, ELF3
## PC_ 4
## Positive: DHRS9, CPA3, TPSD1, TPSAB1, SPRR1B, TPSB2, MS4A2, GATA2, HDC, SPRR2A
## HPGDS, SLC18A2, IL1RL1, PDZK1IP1, RDH10, RP11-354E11.2, KIT, ADGRF1, HPGD, SCEL
## CNRIP1, RNF39, VWA5A, MUC4, ERO1A, TMPRSS11D, PRSS22, RGS13, RP11-32B5.7, MAOB
## Negative: SFRP1, MSLN, GAS6, FBLN1, PCSK1, LINC00473, IGFBP4, IGFBP2, FOLR1, IGFBP7
## BCAM, TNS4, SOD3, TUSC3, CPE, RAMP1, ADIRF, ANXA8, RAET1G, EPCAM
## HTRA1, INHA, PON3, RP11-49I11.1, SMOC1, FEZ1, ARG2, NNMT, CAV2, ANXA8L1
## PC_ 5
## Positive: SLC11A1, VCAN, TREM1, MCEMP1, LINC01272, AQP9, CD7, PLIN2, SPP1, C5AR1
## SDC2, ANGPTL4, CXCL8, ANPEP, GZMA, CTSL, UPP1, MMP19, NKG7, CCL5
## THBS1, KLRB1, PLAUR, RGCC, PRF1, CTSW, FCGR3A, CD300E, GPNMB, FCN1
## Negative: HLA-DQA1, HLA-DPB1, HLA-DPA1, HLA-DQB2, CD1E, S100B, HLA-DRB1, HLA-DRB5, HLA-DRA, FCER1A
## PKIB, CD1A, CD74, PLD4, GSN, NDRG2, SERPINF1, CLEC10A, CD1C, APOL4
## KCNMB1, CD207, ENPP2, HLA-DQA2, PAK1, AXL, SERPINF2, CEACAM4, SIGLEC10, P2RY13
print(seurat[["pca"]], dims = 1:5, nfeatures = 5)
## PC_ 1
## Positive: KRT7, TACSTD2, SLPI, FXYD3, CLDN7
## Negative: CD74, HLA-DPA1, HLA-DPB1, HLA-DQA1, HLA-DRB1
## PC_ 2
## Positive: AIF1, LYZ, TYROBP, FCER1G, SERPINA1
## Negative: RPS18, PERP, AQP3, KRT7, KRT17
## PC_ 3
## Positive: TPSD1, CPA3, TPSAB1, TPSB2, MS4A2
## Negative: S100A8, S100A12, S100A9, SPRR1B, HLA-DRA
## PC_ 4
## Positive: DHRS9, CPA3, TPSD1, TPSAB1, SPRR1B
## Negative: SFRP1, MSLN, GAS6, FBLN1, PCSK1
## PC_ 5
## Positive: SLC11A1, VCAN, TREM1, MCEMP1, LINC01272
## Negative: HLA-DQA1, HLA-DPB1, HLA-DPA1, HLA-DQB2, CD1E
VizDimLoadings(seurat, dims = 1:2, reduction = "pca")
Visualize the PCA plot
DimPlot(seurat, reduction = "pca") + NoLegend()
In particular DimHeatmap allows for easy exploration of the primary sources of heterogeneity in a dataset, and can be useful when trying to decide which PCs to include for further downstream analyses. Both cells and features are ordered according to their PCA scores. Setting cells to a number plots the ‘extreme’ cells on both ends of the spectrum, which dramatically speeds plotting for large datasets. Though clearly a supervised analysis, we find this to be a valuable tool for exploring correlated feature sets.
DimHeatmap(seurat, dims = 1:3, cells = 500, balanced = TRUE)
Now that we have performed PCA, we can plot metadata and genes on the cell groups. Here we use the FeaturePlot function to plot on the PCA. We specify plotting on the principal components using reduction.use, specifically the first two PCA components (specified in dim.1 and dim.2).
FeaturePlot(seurat, reduction='pca', features=c("nFeature_RNA", "percent.mt"))
How would you plot the same plot using component 2 and component 3? If we wanted to save the Seurat object we could use the save function (and later load the object with load()). We are not going to do this now, but this is one of the ways to share analysis, by saving and sharing Seurat R objects.
# save(seurat, file = "nsclc.Rda")
PCA will create many components of variation but not all are useful. One way to select principal components that explain the variation in our data is to use an elbow or scree plot.
ElbowPlot(seurat)
From this plot we are going to select the first 20 principal components to use in the t-SNE visualization.
The JackStrawPlot function provides a visualization tool for comparing the distribution of p-values for each PC with a uniform distribution (dashed line). ‘Significant’ PCs will show a strong enrichment of features with low p-values (solid curve above the dashed line). This process takes a while, so instead of doing 100 replicates, we do 10 replicates.
#Uncoment below if you want to run a resampling test inspired by the JackStraw procedure. We randomly permute a subset of the data (1% by default) and rerun PCA, constructing a ‘null distribution’ of feature scores, and repeat this procedure. We identify ‘significant’ PCs as those who have a strong enrichment of low p-value features.
#seurat <- JackStraw(seurat, num.replicate = 100)
#seurat <- ScoreJackStraw(seurat, dims = 1:20)
#JackStrawPlot(seurat, dims = 1:15)
tSNE is a way to reduce data dimensionality and visualize data, but it is not a clustering approach. Please read How to use tSNE effectively (https://distill.pub/2016/misread-tsne/)
UMAP is a much faster visualization than tSNE and may preserve more of the global structure in your data. The preprint for the method is available at (https://arxiv.org/abs/1802.03426).
seurat <- RunTSNE(seurat,dims = 1:20) # runs Barnes-hut t-SNE
TSNEPlot(object = seurat) + NoLegend()
seurat <- RunUMAP(seurat, dims = 1:10)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 18:35:35 UMAP embedding parameters a = 0.9922 b = 1.112
## 18:35:35 Read 6122 rows and found 10 numeric columns
## 18:35:35 Using Annoy for neighbor search, n_neighbors = 30
## 18:35:35 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 18:35:39 Writing NN index file to temp file /lab/solexa_scratch/ibarrasa/tmpRstudio//Rtmpm9HJiZ/filec60d49e1ad3f
## 18:35:39 Searching Annoy index using 1 thread, search_k = 3000
## 18:35:43 Annoy recall = 100%
## 18:35:43 Commencing smooth kNN distance calibration using 1 thread
## 18:35:45 Initializing from normalized Laplacian + noise
## 18:35:46 Commencing optimization for 500 epochs, with 252610 positive edges
## 18:36:03 Optimization finished
DimPlot(seurat, reduction = "umap") + NoLegend()
We use a graph-based community detection algorithm to cluster cells into cell subsets. Seurat v3 applies a graph-based clustering approach, building upon initial strategies in Macosko et al (https://www.cell.com/fulltext/S0092-8674(15)00549-8). Importantly, the distance metric which drives the clustering analysis (based on previously identified PCs) remains the same. However, Seurat v3 approach to partioning the cellular distance matrix into clusters has dramatically improved. See https://satijalab.org/seurat/v3.1/pbmc3k_tutorial.html
## Find clusters
# save.SNN = T saves the SNN so that the clustering algorithm can be rerun using the same graph
# but with a different resolution value.
#Find clusters
seurat <- FindNeighbors(object = seurat, dims = 1:20)
## Computing nearest neighbor graph
## Computing SNN
seurat <- FindClusters(object = seurat, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 6122
## Number of edges: 227284
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9023
## Number of communities: 15
## Elapsed time: 1 seconds
# note that you can set label=T to help label individual clusters
TSNEPlot(object = seurat, label=T)+ NoLegend()
Here we compare how the clusters look when mapped onto either the tSNE or UMAP visualizations.
p1 <- TSNEPlot(object = seurat, label=T) + NoLegend()
p2 <- UMAPPlot(object = seurat, label=T) + NoLegend()
CombinePlots(plots = list(p1, p2))
## Warning: CombinePlots is being deprecated. Plots should now be combined using
## the patchwork system.
#save the plots to a pdf file
pdf("./TSNE_UMAPByCluster.pdf", w=11, h=8.5)
CombinePlots(plots = list(p1, p2))
## Warning: CombinePlots is being deprecated. Plots should now be combined using
## the patchwork system.
dev.off()
## png
## 2
pdf("./TSNE_Cluster.pdf", w=11, h=8.5)
p1
dev.off()
## png
## 2
We can now visualize how an individual gene’s expression changes across the different cell subsets. We use FeaturePlot in these visualizations. Plotting known marker genes, helps identifying some of the clusters.
# Plotting genes of interest
# Monocyte/Macrophage markers
FeaturePlot(object = seurat, features = c("CD14", "LYZ","FCGR3A"), cols = c("grey", "blue"), reduction = "tsne")
pdf("./TSNE_MonocyteMacroph.pdf", w=11, h=8.5)
FeaturePlot(object = seurat, features = c("CD14", "LYZ","FCGR3A"), cols = c("grey", "blue"), reduction = "tsne")
dev.off()
## png
## 2
pdf("./UMAP_MonocyteMacroph.pdf", w=11, h=8.5)
FeaturePlot(object = seurat, features = c("CD14", "LYZ","FCGR3A"), cols = c("grey", "blue"), reduction = "umap")
dev.off()
## png
## 2
# T cell markers
genes1 <- c("CD3D", "CD3E", "CD4","CD8A")
#treg markers
genes2 <- c( "FOXP3", "IL7R")
FeaturePlot(object = seurat, features = genes1, cols = c("grey", "blue"), reduction = "tsne")
FeaturePlot(object = seurat, features = genes2, cols = c("grey", "blue"), reduction = "tsne")
FeaturePlot(object = seurat, features = genes1, cols = c("grey", "blue"), reduction = "umap")
FeaturePlot(object = seurat, features = genes2, cols = c("grey", "blue"), reduction = "umap")
pdf("./T_cell_markers1.pdf", w=11, h=8.5)
FeaturePlot(object = seurat, features = genes1, cols = c("grey", "blue"), reduction = "tsne")
dev.off()
## png
## 2
pdf("./T_cell_markers2.pdf", w=11, h=8.5)
FeaturePlot(object = seurat, features = genes2, cols = c("grey", "blue"), reduction = "tsne")
dev.off()
## png
## 2
pdf("./T_cell_markers3.pdf", w=11, h=8.5)
FeaturePlot(object = seurat, features = genes1, cols = c("grey", "blue"), reduction = "umap")
dev.off()
## png
## 2
pdf("./T_cell_markers4.pdf", w=11, h=8.5)
FeaturePlot(object = seurat, features = genes2, cols = c("grey", "blue"), reduction = "umap")
dev.off()
## png
## 2
# NK cell markers
genes <- c("GZMA", "GZMB", "PRF1", "NKG7", "GNLY")
FeaturePlot(object = seurat, features = genes, cols = c("grey", "blue"), reduction = "tsne")
FeaturePlot(object = seurat, features = genes, cols = c("grey", "blue"), reduction = "umap")
pdf("./NK_cell_markers.pdf", w=11, h=8.5)
FeaturePlot(object = seurat, features = genes, cols = c("grey", "blue"), reduction = "tsne")
FeaturePlot(object = seurat, features = genes, cols = c("grey", "blue"), reduction = "umap")
dev.off()
## png
## 2
# B cell markers
genes <- genes <- c("CD19", "MS4A1")
FeaturePlot(object = seurat, features = genes, cols = c("grey", "blue"), reduction = "tsne")
FeaturePlot(object = seurat, features = genes, cols = c("grey", "blue"), reduction = "umap")
pdf("./B_cell_markers.pdf", w=11, h=8.5)
FeaturePlot(object = seurat, features = genes, cols = c("grey", "blue"), reduction = "tsne")
FeaturePlot(object = seurat, features = genes, cols = c("grey", "blue"), reduction = "umap")
dev.off()
## png
## 2
We want to label each cell subset according to what cell type it is. We can update the identity slot to these new identities.
#current.cluster.ids <- c(0:14)
new.cluster.ids <- c("B cells", "CD4+ T cells", "2", "CD8+ T cells", "Monocytes/Macrophages", "Monocytes/Macrophages", "NK cells", "Monocytes/Macrophages", "Treg cells" , "9", "10", "11", "12", "13", "14")
names(new.cluster.ids) <- levels(seurat)
seurat <- RenameIdents(seurat, new.cluster.ids)
DimPlot(seurat, reduction = "tsne", label = TRUE, pt.size = 0.1) + NoLegend()
pdf("./NewLabels.pdf", w=11, h=8.5)
DimPlot(seurat, reduction = "tsne", label = TRUE, pt.size = 0.1) + NoLegend()
dev.off()
## png
## 2
How did this work? We took the original cluster identities (originally by default numbered) and mapped them to new values using the mapvalues command.
In the above code, we have identified several clusters of immune cells. However we also expect to see cell clusters consisting of stromal cells and/or malignant cells in this NSCLC sample. For some of these cell subsets, we may not have pre-existing marker genes that suggest the cell type. To help identify these cells, we look for genes that are differentially expressed in these clusters relative to all other clusters.
MIN_LOGFOLD_CHANGE = 2 # set to minimum required average log fold change in gene expression.
MIN_PCT_CELLS_EXPR_GENE = .25 # minimum percent of cells that must express gene in either clstr.
# Here we find all markers for
all.markers = FindAllMarkers(seurat,
min.pct = MIN_PCT_CELLS_EXPR_GENE,
logfc.threshold = MIN_LOGFOLD_CHANGE,
only.pos = TRUE)
## Calculating cluster B cells
## Calculating cluster CD4+ T cells
## Calculating cluster 2
## Calculating cluster CD8+ T cells
## Calculating cluster Monocytes/Macrophages
## Calculating cluster NK cells
## Calculating cluster Treg cells
## Calculating cluster 9
## Calculating cluster 10
## Calculating cluster 11
## Calculating cluster 12
## Calculating cluster 13
## Calculating cluster 14
Here we found markers for all clusters using a DE test. There are other ways of selecting markers, feel free to read the original Seurat tutorial for more details. Here we use the Wilcoxon Rank Sum test, “wilcox”. There are many other options, including MAST.
Let’s look at the top markers for each cluster sorted by p-value.
# sort all the markers by p-value
all.markers.sortedByPval = all.markers[order(all.markers$p_val),]
# take a look at the top most significant markers
head(all.markers.sortedByPval)
## p_val avg_logFC pct.1 pct.2 p_val_adj cluster gene
## MS4A1 0 2.399473 0.962 0.052 0 B cells MS4A1
## CD79A 0 2.102201 0.963 0.093 0 B cells CD79A
## KRT17 0 3.976961 0.930 0.107 0 2 KRT17
## S100A2 0 3.785467 0.913 0.083 0 2 S100A2
## LCN2 0 3.573257 0.741 0.020 0 2 LCN2
## SLPI 0 3.540244 0.833 0.038 0 2 SLPI
We can also make heat maps of the top markers for each cluster.
top10 <- all.markers.sortedByPval %>% group_by(cluster) %>% do(head(., n=10))
DoHeatmap(object = seurat, features = top10$gene)+ NoLegend() +
theme(axis.text.y = element_text(color = "black", size = 5))
pdf("./HeatMaptop10Markers.pdf", w=11, h=8.5)
DoHeatmap(object = seurat, features = top10$gene)+ NoLegend() +
theme(axis.text.y = element_text(color = "black", size = 5))
dev.off()
## png
## 2
Based on these differentially expressed genes, can you identify additional cell subsets?
You can save the object at this point so that it can easily be loaded back into an R session without having to rerun the computationally intensive steps performed above, or easily shared with collaborators.
save(seurat, all.markers, file = "~/nsclc.Rda")
# examine the top 4 markers in the context of the tSNE plots:
FeaturePlot(seurat, features = all.markers.sortedByPval$gene[1:4], reduction = "tsne")
In the tests above for differentially expressed genes, some differentially expressed genes may be common across clusters. For example, the 3 clusters of T cells all have IL7R (T cell development gene) as a differentially expressed gene. To find genes that are differentially expressed and unique to the cell cluster, we can use the below code.
# Find differentially expressed genes that are only differentially expressed in one cluster of cells.
genes_uniquely_DE = all.markers.sortedByPval %>% dplyr::filter(avg_logFC >= MIN_LOGFOLD_CHANGE) %>% group_by(gene) %>% summarize(n=n()) %>% dplyr::filter(n==1)
## `summarise()` ungrouping output (override with `.groups` argument)
# Sort DE genes by p value.
genes_uniquely_DE.markers.sortedByPval =
all.markers.sortedByPval[all.markers.sortedByPval$gene %in% genes_uniquely_DE$gene &
all.markers.sortedByPval$avg_logFC >= MIN_LOGFOLD_CHANGE,]
# Get top 3 unique genes for each cell cluster.
top_marker_each = genes_uniquely_DE.markers.sortedByPval %>% dplyr::group_by(cluster) %>% do(head(., n=3)) # number of top markers for each cluster.
print(top_marker_each)
## Registered S3 method overwritten by 'cli':
## method from
## print.boxx spatstat
## # A tibble: 30 x 7
## # Groups: cluster [11]
## p_val avg_logFC pct.1 pct.2 p_val_adj cluster gene
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
## 1 0 2.40 0.962 0.052 0 B cells MS4A1
## 2 0 2.10 0.963 0.093 0 B cells CD79A
## 3 0 3.98 0.93 0.107 0 2 KRT17
## 4 0 3.79 0.913 0.083 0 2 S100A2
## 5 0 3.57 0.741 0.02 0 2 LCN2
## 6 0 2.44 0.64 0.051 0 CD8+ T cells GZMK
## 7 0 4.26 0.943 0.037 0 Monocytes/Macrophages LYZ
## 8 0 2.89 0.97 0.093 0 Monocytes/Macrophages TYROBP
## 9 0 2.70 0.461 0.014 0 Monocytes/Macrophages C1QC
## 10 0 3.90 0.744 0.02 0 NK cells GNLY
## # … with 20 more rows
DoHeatmap(object = seurat, features = top_marker_each$gene)+ NoLegend() +
theme(axis.text.y = element_text(color = "black", size = 5))
pdf("./HeatMaptop3Markers.pdf", w=11, h=8.5)
DoHeatmap(object = seurat, features = top_marker_each$gene)+ NoLegend() +
theme(axis.text.y = element_text(color = "black", size = 5))
dev.off()
## png
## 2
Based on these differentially expressed genes, can you identify additional cell subsets?
We can look at marker gene expression using violin plots.
#uncomment the loop below if you want to make a plot for each of the marker genes.
#for (i in 1:length(top_marker_each$gene)) {
# print(VlnPlot(seurat, features = top_marker_each$gene[i]), pt.size = 0.1) +NoLegend()
#}
#look at the firt 5 marker genes:
for (i in 1:5) {
print(VlnPlot(seurat, features = top_marker_each$gene[i]), pt.size = 0.1) +NoLegend()
}
Dotplots are a concise way of exploring percent of cells expressing a gene and the gene expression intensity. This gives much of the same information a heatmap gives us without the resolution of looking at every cell. This can shield us from some of the noise caused by drop out and scales better with larger datasets.
DotPlot(seurat, features=unique(top_marker_each$gene), dot.scale = 6) + RotatedAxis() +theme(axis.text.x = element_text(color = "black", size = 7))
DotPlot(seurat, features=unique(top_marker_each$gene), dot.scale = 6) + RotatedAxis() +theme(axis.text.x = element_text(color = "black", size = 7)) + NoLegend()
pdf("./DotPlot.pdf", w=11, h=8.5)
DotPlot(seurat, features=unique(top_marker_each$gene), dot.scale = 6) + RotatedAxis() +theme(axis.text.x = element_text(color = "black", size = 7))
dev.off()
## png
## 2
Congratulations! You can identify and visualize cell subsets and the marker genes that describe these cell subsets. This is a very powerful analysis pattern often seen in publications. Well done!
# Save details of your session
print(sessionInfo(), locale=FALSE)
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_3.3.2 reshape_0.8.8 gdata_2.18.0 Matrix_1.2-18 dplyr_1.0.2
## [6] Seurat_3.2.2
##
## loaded via a namespace (and not attached):
## [1] Rtsne_0.15 colorspace_1.4-1 deldir_0.1-29
## [4] ellipsis_0.3.1 ggridges_0.5.2 spatstat.data_1.4-3
## [7] leiden_0.3.3 listenv_0.8.0 farver_2.0.3
## [10] ggrepel_0.8.2 RSpectra_0.16-0 fansi_0.4.1
## [13] codetools_0.2-16 splines_3.6.3 knitr_1.30
## [16] polyclip_1.10-0 jsonlite_1.7.1 ica_1.0-2
## [19] cluster_2.1.0 png_0.1-7 uwot_0.1.8
## [22] shiny_1.5.0 sctransform_0.3.1 compiler_3.6.3
## [25] httr_1.4.2 assertthat_0.2.1 fastmap_1.0.1
## [28] lazyeval_0.2.2 limma_3.42.2 cli_2.1.0
## [31] later_1.1.0.1 formatR_1.7 htmltools_0.5.0
## [34] tools_3.6.3 rsvd_1.0.3 igraph_1.2.6
## [37] gtable_0.3.0 glue_1.4.2 RANN_2.6.1
## [40] reshape2_1.4.4 rappdirs_0.3.1 Rcpp_1.0.5
## [43] spatstat_1.64-1 vctrs_0.3.4 nlme_3.1-149
## [46] lmtest_0.9-38 xfun_0.18 stringr_1.4.0
## [49] globals_0.13.1 mime_0.9 miniUI_0.1.1.1
## [52] lifecycle_0.2.0 irlba_2.3.3 gtools_3.8.2
## [55] goftest_1.2-2 future_1.19.1 MASS_7.3-53
## [58] zoo_1.8-8 scales_1.1.1 promises_1.1.1
## [61] spatstat.utils_1.17-0 parallel_3.6.3 RColorBrewer_1.1-2
## [64] yaml_2.2.1 reticulate_1.16 pbapply_1.4-3
## [67] gridExtra_2.3 rpart_4.1-15 stringi_1.5.3
## [70] rlang_0.4.8 pkgconfig_2.0.3 matrixStats_0.57.0
## [73] evaluate_0.14 lattice_0.20-41 ROCR_1.0-11
## [76] purrr_0.3.4 tensor_1.5 patchwork_1.0.1
## [79] htmlwidgets_1.5.2 labeling_0.3 cowplot_1.1.0
## [82] tidyselect_1.1.0 RcppAnnoy_0.0.16 plyr_1.8.6
## [85] magrittr_1.5 R6_2.4.1 generics_0.0.2
## [88] pillar_1.4.6 withr_2.3.0 mgcv_1.8-33
## [91] fitdistrplus_1.1-1 survival_3.2-7 abind_1.4-5
## [94] tibble_3.0.4 future.apply_1.6.0 crayon_1.3.4
## [97] KernSmooth_2.23-17 utf8_1.1.4 plotly_4.9.2.1
## [100] rmarkdown_2.4 grid_3.6.3 data.table_1.13.0
## [103] digest_0.6.25 xtable_1.8-4 tidyr_1.1.2
## [106] httpuv_1.5.4 munsell_0.5.0 viridisLite_0.3.0