Single-cell RNA-seq Demo (10X Non-Small Cell Lung Cancer)

Follow these steps:

Go to this address in your browser and login to https://rstudio.wi.mit.edu
Type the following commands:

# setwd("/nfs/BaRC_training/SingleCellRNA-seq")
# dir.create("userName")   #Use your userName (eg. jdoe)

Open the file with exercises and save it in your folder:

File -> Open File… -> “SingleCell_Seurat_2022.Rmd”

use “Save As…” to save the Rmd to your directory/folder #cd to your folder # setwd(“/nfs/BaRC_training/SingleCellRNA-seq/userName”)

Goal

To give you experience with the analysis of single cell RNA sequencing (scRNA-seq) including performing quality control and identifying cell type subsets.
To introduce you to scRNA-seq analysis using the Seurat package.

Introduction

Data produced in a single cell RNA-seq experiment has several interesting characteristics that make it distinct from data produced in a bulk population RNA-seq experiment. Two characteristics that are important to keep in mind when working with scRNA-Seq are the large amount of zeros due to limiting mRNA and the potential for quality control (QC) metrics to be confounded with biology. This combined with the ability to measure heterogeniety from cells in samples has shifted the field away from the typical analysis in population-based RNA-Seq. Here we demonstrate some approaches to quality control, followed by identifying and analyzing cell subsets.

For this tutorial, we will be analyzing the a dataset of Non-Small Cell Lung Cancer Cells (NSCLC) freely available from 10X Genomics (https://support.10xgenomics.com/single-cell-vdj/datasets/2.2.0/vdj_v1_hs_nsclc_5gex), using the Seurat R package (http://satijalab.org/seurat/), a popular and powerful set of tools to conduct scRNA-seq analysis in R. In this dataset, there are 7802 single cells that were sequenced on the Illumina NovaSeq 6000. Please note this tutorial borrows heavily from Seurat’s tutorials (https://satijalab.org/seurat/vignettes.html), so feel free to go through them in more detail.

Load necessary packages

When loading libraries, we are asking R to load code for us written by someone else. It is a convenient way to leverage and reproduce methodology developed by others.

library(Seurat)
library(dplyr)
library(Matrix)
library(gdata)
library(reshape)
library(ggplot2)
library(patchwork)
packageDescription("Seurat")$Version

## [1] "4.0.3"

Read in NSCLC counts matrix.

The data for Non-Small Cell Lung Cancer Cells (NSCLC) is freely available from 10X Genomics (https://support.10xgenomics.com/single-cell-vdj/datasets/2.2.0/vdj_v1_hs_nsclc_5gex).

We will use Seurat to analyze this scRNA-seq sample:

http://satijalab.org/seurat/

A widely used reagent for 3’ single cell transcriptomics sequencing is the 3 prime 10X assay. Sequencing data from 10X can be transformed to gene count matrices using the Cell Ranger pipeline. We start by reading in the counts matrix generated by the Cell Ranger count program.

counts_matrix_filename = "/nfs/BaRC_training/SingleCellRNA-seq/filtered_gene_bc_matrices/GRCh38/"
counts <- Read10X(data.dir = counts_matrix_filename)

Seurat works with the count matrix in the sparse format, which is far more memory efficient.

seurat <- CreateSeuratObject(counts, project = "10X_NSCLC")

## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')

Almost all our analysis will be on the single object, of class Seurat, created above. This object contains various “slots” that will store not only the raw count data, but also the results from various computations below. This has the advantage that we do not need to keep track of inidividual variables of interest - they can all be collapsed into a single object as long as these slots are pre-defined. This object was made with the CreateSeuratObject().
You could filter out genes that are are expressed in a low number of cels, i.e. 2 or fewer cells, and filter cells with low complexity, i.e. less than 350 genes, with a command like this: “seurat <- CreateSeuratObject(counts, min.cells = 3, min.features = 350, project =”10X_NSCLC“)”. However, for now we are keeping all the genes and cells. Keeping all the genes makes integration with other samples easier, since you want all the samples to have the same number of genes assayed.

The “GetAssayData” function allows us to get the counts/raw data stored on the “counts” slot. As we proceed on our analysis other slots like “data” (normalized data) and “scale.data” (data scaled by gene) will be populated.

GetAssayData(object = seurat, slot = "counts")[1:10,1:10]

## 10 x 10 sparse Matrix of class "dgCMatrix"

##    [[ suppressing 10 column names 'AAACCTGAGCTAGTCT-1', 'AAACCTGAGGGCACTA-1', 'AAACCTGAGTACGTTC-1' ... ]]

##                                  
## RP11-34P13.3  . . . . . . . . . .
## FAM138A       . . . . . . . . . .
## OR4F5         . . . . . . . . . .
## RP11-34P13.7  . . . . . . . . . .
## RP11-34P13.8  . . . . . . . . . .
## RP11-34P13.14 . . . . . . . . . .
## RP11-34P13.9  . . . . . . . . . .
## FO538757.3    . . . . . . . . . .
## FO538757.2    . . . . 1 2 . . . .
## AP006222.2    . . . . . . . . . .

Preprocessing step 1: Filter out low-quality cells

We would like to exclude cells that are damaged. A common metric for this is the proportion of counts derived from mitochondrial genes.

Here we calculate the percent of mitochondrial reads and add it to the Seurat object in the slot named meta.data, using the function “PercentageFeatureSet”.

# The number of genes and UMIs (nFeature_RNA and nCount_RNA) are automatically calculated for every object by Seurat.

# this is step adds the % mit content to the metadata
seurat[["percent.mt"]] <- PercentageFeatureSet(seurat, pattern = "^MT-")

head(seurat@meta.data)

##                    orig.ident nCount_RNA nFeature_RNA percent.mt
## AAACCTGAGCTAGTCT-1  10X_NSCLC       3605         1184  20.832178
## AAACCTGAGGGCACTA-1  10X_NSCLC       3828         1387   3.448276
## AAACCTGAGTACGTTC-1  10X_NSCLC       6457         1784   2.865108
## AAACCTGAGTCCGGTC-1  10X_NSCLC       3075         1092   6.276423
## AAACCTGCACCAGGTC-1  10X_NSCLC       9400         2626   3.202128
## AAACCTGCACCTCGTT-1  10X_NSCLC      48839         5603   2.321915

Next, we plot the number of genes, number of UMIs and the percent of mitochondrial reads per cell using the violin plot function provided by Seurat. Cells in this plot are group by what Seurat calls orig.ident. Identity is a concept that is used in the Seurat object to refer to the cell identity. In this case, the cell identity is 10X_NSCLC, but after we cluster the cells, the cell identity will be whatever cluster the cell belongs to. We will see how identity updates as we go through the analysis.

VlnPlot(seurat, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3, pt.size = 0.1) +NoLegend()

Next we use the function FeatureScatter to explore the number of genes, number of UMIs and the percent of mitochondrial reads per cell.

# FeatureScatter is typically used to visualize feature-feature relationships, but can be used
# for anything calculated by the object, i.e. columns in object metadata, PC scores etc.

plot1 <- FeatureScatter(seurat, feature1 = "nCount_RNA", feature2 = "percent.mt") + NoLegend()
plot2 <- FeatureScatter(seurat, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") + NoLegend()
#CombinePlots(plots = list(plot1, plot2))
plot1 + plot2

FeatureScatter is typically used to visualize gene-gene relationships, but can be used for anything calculated by the object, i.e. columns in object@meta.data, PC scores etc. Since there is a rare subset of cells with an outlier level of high mitochondrial percentage and also low UMI content, we will filter these cells. The title of these plots is the Pearson correlation between the two variables: percent.mito vs nCount_RNA or nFeature_RNA (number of genes) vs nCount_RNA.

Examine contents of Seurat object

These are the slots in the Seurat object. Some of the slots are automatically updated by Seurat as you move through analysis. Take a moment to look through the information, knowing the slots allow you to leverage work Seurat has already done for you.

str(seurat)

## Formal class 'Seurat' [package "SeuratObject"] with 13 slots
##   ..@ assays      :List of 1
##   .. ..$ RNA:Formal class 'Assay' [package "SeuratObject"] with 8 slots
##   .. .. .. ..@ counts       :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. .. .. .. ..@ i       : int [1:13871580] 45 61 153 156 160 197 216 248 254 264 ...
##   .. .. .. .. .. ..@ p       : int [1:7803] 0 1184 2571 4355 5447 8073 13676 15863 17506 18607 ...
##   .. .. .. .. .. ..@ Dim     : int [1:2] 33694 7802
##   .. .. .. .. .. ..@ Dimnames:List of 2
##   .. .. .. .. .. .. ..$ : chr [1:33694] "RP11-34P13.3" "FAM138A" "OR4F5" "RP11-34P13.7" ...
##   .. .. .. .. .. .. ..$ : chr [1:7802] "AAACCTGAGCTAGTCT-1" "AAACCTGAGGGCACTA-1" "AAACCTGAGTACGTTC-1" "AAACCTGAGTCCGGTC-1" ...
##   .. .. .. .. .. ..@ x       : num [1:13871580] 1 2 7 1 1 3 1 1 1 1 ...
##   .. .. .. .. .. ..@ factors : list()
##   .. .. .. ..@ data         :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. .. .. .. ..@ i       : int [1:13871580] 45 61 153 156 160 197 216 248 254 264 ...
##   .. .. .. .. .. ..@ p       : int [1:7803] 0 1184 2571 4355 5447 8073 13676 15863 17506 18607 ...
##   .. .. .. .. .. ..@ Dim     : int [1:2] 33694 7802
##   .. .. .. .. .. ..@ Dimnames:List of 2
##   .. .. .. .. .. .. ..$ : chr [1:33694] "RP11-34P13.3" "FAM138A" "OR4F5" "RP11-34P13.7" ...
##   .. .. .. .. .. .. ..$ : chr [1:7802] "AAACCTGAGCTAGTCT-1" "AAACCTGAGGGCACTA-1" "AAACCTGAGTACGTTC-1" "AAACCTGAGTCCGGTC-1" ...
##   .. .. .. .. .. ..@ x       : num [1:13871580] 1 2 7 1 1 3 1 1 1 1 ...
##   .. .. .. .. .. ..@ factors : list()
##   .. .. .. ..@ scale.data   : num[0 , 0 ] 
##   .. .. .. ..@ key          : chr "rna_"
##   .. .. .. ..@ assay.orig   : NULL
##   .. .. .. ..@ var.features : logi(0) 
##   .. .. .. ..@ meta.features:'data.frame':   33694 obs. of  0 variables
##   .. .. .. ..@ misc         : list()
##   ..@ meta.data   :'data.frame': 7802 obs. of  4 variables:
##   .. ..$ orig.ident  : Factor w/ 1 level "10X_NSCLC": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ nCount_RNA  : num [1:7802] 3605 3828 6457 3075 9400 ...
##   .. ..$ nFeature_RNA: int [1:7802] 1184 1387 1784 1092 2626 5603 2187 1643 1101 38 ...
##   .. ..$ percent.mt  : num [1:7802] 20.83 3.45 2.87 6.28 3.2 ...
##   ..@ active.assay: chr "RNA"
##   ..@ active.ident: Factor w/ 1 level "10X_NSCLC": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..- attr(*, "names")= chr [1:7802] "AAACCTGAGCTAGTCT-1" "AAACCTGAGGGCACTA-1" "AAACCTGAGTACGTTC-1" "AAACCTGAGTCCGGTC-1" ...
##   ..@ graphs      : list()
##   ..@ neighbors   : list()
##   ..@ reductions  : list()
##   ..@ images      : list()
##   ..@ project.name: chr "10X_NSCLC"
##   ..@ misc        : list()
##   ..@ version     :Classes 'package_version', 'numeric_version'  hidden list of 1
##   .. ..$ : int [1:3] 4 0 2
##   ..@ commands    : list()
##   ..@ tools       : list()

Next, let’s filter the cells based on the quality control metrics: Filtering based on the minimun number of genes detected per cell, and the percentage of counts mapping to mitochondrial genes is aim to remove low quality cells. The filter on the max number of genes detected per cell (nFeature_RNA < 5000) is aim to remove doublets. However, this is a very naive way to remove doublets. Please refer to the slides for software used to remove doublets. Additionally, if your sample contains a mix of cell types with very different total number of genes expressed, then using an upper bound on the total number of genes expressed may not be appropriate.

seurat <- subset(seurat, subset = nFeature_RNA > 350 & nFeature_RNA < 5000 & percent.mt < 10)

Visualize the number of genes, number of UMIs and the percent of mitochondrial reads per cell after filtering on number of genes and % counts mapping to mitochondrial genes.

VlnPlot(seurat, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3, pt.size = 0.1) +NoLegend()

Preprocessing step 2: Expression normalization

By default, we employ a global-scaling normalization method “LogNormalize” that normalizes the gene expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result. The log-transformation is a commonly used transformation that has many desirable properties, such as variance stabilization. Seurat uses natural log.

seurat <- NormalizeData(seurat, normalization.method = "LogNormalize", scale.factor = 10000)

Detection of variable genes across the single cells

Seurat calculates highly variable genes and focuses on these for downstream analysis. FindVariableFeatures, by default, it returns 2,000 features per dataset. More detailed can be found here: https://www.ncbi.nlm.nih.gov/pubmed/31178118

seurat <- FindVariableFeatures(seurat, selection.method = "vst", nfeatures = 2000)
length(VariableFeatures(seurat))

## [1] 2000

# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(seurat), 10)

# plot variable features with and without labels
plot1 <- VariableFeaturePlot(seurat)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)

## When using repel, set xnudge and ynudge to 0 for optimal results

#CombinePlots(plots = list(plot1, plot2))
plot1

## Warning: Transformation introduced infinite values in continuous x-axis

## Warning: Removed 11626 rows containing missing values (geom_point).

plot2

## Warning: Transformation introduced infinite values in continuous x-axis

## Warning: Removed 11626 rows containing missing values (geom_point).

Scaling the data

Next, we apply a linear transformation (‘scaling’) that is a standard pre-processing step prior to dimensional reduction techniques like PCA. The ScaleData function: - Shifts the expression of each gene, so that the mean expression across cells is 0 - Scales the expression of each gene, so that the variance across cells is 1 - This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate - The results of this are stored in seurat[[“RNA”]]@scale.data

all.genes <- rownames(seurat)
seurat <- ScaleData(seurat, features = all.genes)

## Centering and scaling data matrix

Perform linear dimensional reduction

seurat <- RunPCA(seurat, features = VariableFeatures(object = seurat))

## PC_ 1 
## Positive:  KRT7, TACSTD2, SLPI, FXYD3, CLDN7, S100P, MPZL2, PTGES, KRT19, SERINC2 
##     ELF3, DSP, C19orf33, KRT17, LCN2, DDR1, KLF5, GPRC5A, KRT18, MDK 
##     S100A14, S100A16, ID1, TRIM29, IFITM3, TMPRSS4, CLDN4, SPINT1, SERPINB3, PERP 
## Negative:  CD74, HLA-DPA1, HLA-DPB1, HLA-DQA1, HLA-DRB1, HLA-DRA, HERPUD1, IGHA1, TNF, PIM2 
##     ITM2A, CCL5, CD7, KIAA0125, PLAC8, GZMA, KLRB1, STAG3, AC079767.4, IGKC 
##     DERL3, CTSW, C12orf75, BATF, JCHAIN, GZMK, CST7, HOPX, HLA-DRB5, TRBC1 
## PC_ 2 
## Positive:  AIF1, LYZ, TYROBP, FCER1G, SERPINA1, FCGR2A, MS4A6A, CD14, CD68, CSF1R 
##     CPVL, FCGRT, FPR1, MAFB, SPI1, CLEC7A, CD163, RNF130, PILRA, C1orf162 
##     IGSF6, PLXDC2, CST3, SLC11A1, LILRB2, FTL, NPC2, LST1, MNDA, NCF2 
## Negative:  RPS18, PERP, AQP3, KRT7, KRT17, CD24, FXYD3, DSP, C19orf33, SLPI 
##     S100A2, KRT18, PTGES, KLF5, S100P, S100A14, LCN2, S100A16, CDKN2A, ELF3 
##     TRIM29, KRT19, DDR1, SERPINB3, TACSTD2, GPRC5A, KRT6A, RAB25, MDK, TFAP2A 
## PC_ 3 
## Positive:  TPSD1, CPA3, TPSAB1, TPSB2, MS4A2, GATA2, HPGDS, SLC18A2, KIT, HDC 
##     RP11-354E11.2, IL1RL1, CLU, VWA5A, MAOB, CNRIP1, RP11-32B5.7, HPGD, LTC4S, RGS13 
##     SLC45A3, GMPR, NTRK1, TMEM176B, GALC, ENPP3, SDPR, CALB2, TIMP3, CD9 
## Negative:  S100A8, S100A12, S100A9, SPRR1B, HLA-DRA, PDZK1IP1, NAMPT, CD74, SPRR2A, TMPRSS11D 
##     RDH10, ADGRF1, ERO1A, HLA-DRB1, FCN1, SCEL, CSTA, PRSS22, F3, APOBEC3A 
##     MUC4, HLA-DPA1, RAB31, RNF39, VCAN, TMPRSS4, BCL2A1, CLEC4E, HLA-DPB1, ELF3 
## PC_ 4 
## Positive:  DHRS9, CPA3, TPSD1, TPSAB1, TPSB2, SPRR1B, MS4A2, GATA2, HDC, SPRR2A 
##     HPGDS, SLC18A2, IL1RL1, PDZK1IP1, RDH10, RP11-354E11.2, KIT, ADGRF1, HPGD, CNRIP1 
##     SCEL, VWA5A, RNF39, MUC4, ERO1A, TMPRSS11D, PRSS22, RGS13, RP11-32B5.7, MAOB 
## Negative:  SFRP1, MSLN, GAS6, FBLN1, IGFBP4, PCSK1, IGFBP2, LINC00473, FOLR1, IGFBP7 
##     BCAM, SOD3, TNS4, TUSC3, CPE, RAMP1, ADIRF, ANXA8, RAET1G, HTRA1 
##     EPCAM, INHA, PON3, RP11-49I11.1, SMOC1, FEZ1, ARG2, NNMT, CAV2, ANXA8L1 
## PC_ 5 
## Positive:  HLA-DQA1, HLA-DQB2, HLA-DPB1, CD1E, HLA-DPA1, S100B, HLA-DRB1, HLA-DRB5, FCER1A, HLA-DRA 
##     PKIB, CD1A, PLD4, CD74, GSN, NDRG2, SERPINF1, CLEC10A, CD1C, APOL4 
##     KCNMB1, CD207, ENPP2, HLA-DQA2, AXL, PAK1, SERPINF2, CEACAM4, P2RY13, C1QB 
## Negative:  SLC11A1, VCAN, TREM1, MCEMP1, LINC01272, AQP9, PLIN2, SPP1, CD7, C5AR1 
##     SDC2, ANGPTL4, CXCL8, ANPEP, CTSL, MMP19, GZMA, UPP1, THBS1, PLAUR 
##     RGCC, NKG7, CCL5, KLRB1, CD300E, FCGR3A, GPNMB, PRF1, CTSW, SLC25A37

print(seurat[["pca"]], dims = 1:5, nfeatures = 5)

## PC_ 1 
## Positive:  KRT7, TACSTD2, SLPI, FXYD3, CLDN7 
## Negative:  CD74, HLA-DPA1, HLA-DPB1, HLA-DQA1, HLA-DRB1 
## PC_ 2 
## Positive:  AIF1, LYZ, TYROBP, FCER1G, SERPINA1 
## Negative:  RPS18, PERP, AQP3, KRT7, KRT17 
## PC_ 3 
## Positive:  TPSD1, CPA3, TPSAB1, TPSB2, MS4A2 
## Negative:  S100A8, S100A12, S100A9, SPRR1B, HLA-DRA 
## PC_ 4 
## Positive:  DHRS9, CPA3, TPSD1, TPSAB1, TPSB2 
## Negative:  SFRP1, MSLN, GAS6, FBLN1, IGFBP4 
## PC_ 5 
## Positive:  HLA-DQA1, HLA-DQB2, HLA-DPB1, CD1E, HLA-DPA1 
## Negative:  SLC11A1, VCAN, TREM1, MCEMP1, LINC01272

VizDimLoadings(seurat, dims = 1:2, reduction = "pca")

Visualize the PCA plot

DimPlot(seurat, reduction = "pca") + NoLegend()

In particular DimHeatmap allows for easy exploration of the primary sources of heterogeneity in a dataset, and can be useful when trying to decide which PCs to include for further downstream analyses. Both cells and features are ordered according to their PCA scores.

DimHeatmap(seurat, dims = 1:3, cells = 500, balanced = TRUE)

Now that we have performed PCA, we can plot metadata and genes on the cell groups. Here we use the FeaturePlot function to plot on the PCA. We specify plotting on the principal components using “reduction”, specifically the first two PCA components (using: dims = c(1, 2)).

FeaturePlot(seurat,  reduction='pca', features=c("nFeature_RNA",  "percent.mt"))

If we wanted to save the Seurat object we could use the save function (and later load the object with load()). We are not going to do this now, but this is one of the ways to share analysis, by saving and sharing Seurat R objects.

# save(seurat, file = "nsclc.Rda")

Determine the ‘dimensionality’ of the dataset

PCA will create many components of variation but not all are useful. One way to select principal components that explain the variation in our data is to use an elbow plot.

ElbowPlot(seurat)

From this plot we are going to select the first 20 principal components to use in the t-SNE visualization.

The JackStrawPlot function provides a visualization tool for comparing the distribution of p-values for each PC with a uniform distribution (dashed line). ‘Significant’ PCs will show a strong enrichment of features with low p-values (solid curve above the dashed line). This process takes a while, if you want to run it uncomment the code below.

#Uncoment below if you want to run a resampling test inspired by the JackStraw procedure. We randomly permute a subset of the data (1% by default) and rerun PCA, constructing a ‘null distribution’ of feature scores, and repeat this procedure. We identify ‘significant’ PCs as those who have a strong enrichment of low p-value features.
#seurat <- JackStraw(seurat, num.replicate = 100)
#seurat <- ScoreJackStraw(seurat, dims = 1:20)
#JackStrawPlot(seurat, dims = 1:15)

Visualization by tSNE and UMAP

tSNE is a way to reduce data dimensionality and visualize data, but it is not a clustering approach. Please read how to use tSNE effectively (https://distill.pub/2016/misread-tsne/)

UMAP is a much faster visualization than tSNE and may preserve more of the global structure in your data. The preprint for the method is available at (https://arxiv.org/abs/1802.03426).

seurat <- RunTSNE(seurat,dims = 1:20) 
TSNEPlot(object = seurat)  + NoLegend()

seurat <- RunUMAP(seurat, dims = 1:10)

## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session

## 13:08:04 UMAP embedding parameters a = 0.9922 b = 1.112

## 13:08:04 Read 6123 rows and found 10 numeric columns

## 13:08:04 Using Annoy for neighbor search, n_neighbors = 30

## 13:08:04 Building Annoy index with metric = cosine, n_trees = 50

## 0%   10   20   30   40   50   60   70   80   90   100%

## [----|----|----|----|----|----|----|----|----|----|

## **************************************************|
## 13:08:05 Writing NN index file to temp file /lab/solexa_scratch/ibarrasa/tmpRstudio//RtmphsMaGZ/file14242d7833f455
## 13:08:05 Searching Annoy index using 1 thread, search_k = 3000
## 13:08:06 Annoy recall = 100%
## 13:08:07 Commencing smooth kNN distance calibration using 1 thread
## 13:08:08 Initializing from normalized Laplacian + noise
## 13:08:08 Commencing optimization for 500 epochs, with 252566 positive edges
## 13:08:17 Optimization finished

DimPlot(seurat, reduction = "umap")  + NoLegend()

Cluster the cells

We use a graph-based community detection algorithm to cluster cells into cell subsets. Seurat applies a graph-based clustering approach, building upon initial strategies in Macosko et al (https://www.cell.com/fulltext/S0092-8674(15)00549-8). Importantly, the distance metric which drives the clustering analysis (based on previously identified PCs) remains the same. However, Seurat approach to partitioning the cellular distance matrix into clusters has dramatically improved. See https://satijalab.org/seurat/v3.1/pbmc3k_tutorial.html

Changing the “resolution” parameters will affect the final number of clusters found The resolution value used depends on how coarse or fine you would like the cell types/states characterization to be.

## Find clusters
# save.SNN = T saves the SNN so that the clustering algorithm can be rerun using the same graph
# but with a different resolution value.
#Find clusters 
seurat <- FindNeighbors(object = seurat, dims = 1:20)

## Computing nearest neighbor graph

## Computing SNN

seurat <- FindClusters(object = seurat, resolution = 0.5)

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 6123
## Number of edges: 224228
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9031
## Number of communities: 15
## Elapsed time: 0 seconds

# note that you can set label=T to help label individual clusters
TSNEPlot(object = seurat, label=T)+ NoLegend()

Compare tSNE and UMAP visualizations

Here we compare how the clusters look when mapped onto either the tSNE or UMAP visualizations.

p1 <- TSNEPlot(object = seurat, label=T) + NoLegend()
p2 <- UMAPPlot(object = seurat, label=T) + NoLegend()

CombinePlots(plots = list(p1, p2))

## Warning: CombinePlots is being deprecated. Plots should now be combined using
## the patchwork system.

#save the plots to a pdf file
pdf("./TSNE_UMAPByCluster.pdf", w=11, h=8.5)
#CombinePlots(plots = list(p1, p2))
p1 + p2
dev.off()

## png 
##   2

pdf("./TSNE_Cluster.pdf", w=11, h=8.5)
p1
dev.off()

## png 
##   2

Mapping gene expression onto cell clusters and assigning identity to clusters

We can now visualize how an individual gene’s expression changes across the different cell subsets. We use FeaturePlot in these visualizations. Plotting known marker genes, helps identifying some of the clusters.

Monocyte/Macrophage markers

FeaturePlot(object = seurat, features = c("CD14", "LYZ","FCGR3A"), cols = c("grey", "blue"), reduction = "tsne")

pdf("./TSNE_MonocyteMacroph.pdf", w=11, h=8.5)
FeaturePlot(object = seurat, features = c("CD14", "LYZ","FCGR3A"), cols = c("grey", "blue"), reduction = "tsne")
dev.off()

## png 
##   2

pdf("./UMAP_MonocyteMacroph.pdf", w=11, h=8.5)
FeaturePlot(object = seurat, features = c("CD14", "LYZ","FCGR3A"), cols = c("grey", "blue"), reduction = "umap")
dev.off()

## png 
##   2

T cell markers and treg markers

# T cell markers 
genes1 <- c("CD3D", "CD3E", "IL7R", "CD4","CD8A")
FeaturePlot(object = seurat, features = genes1[1:3], cols = c("grey", "blue"), reduction = "tsne")

FeaturePlot(object = seurat, features = genes1[4:5], cols = c("grey", "blue"), reduction = "tsne")

FeaturePlot(object = seurat, features = genes1, cols = c("grey", "blue"), reduction = "umap")

#treg markers
genes2 <- c( "FOXP3", "CCR8")
FeaturePlot(object = seurat, features = genes2, cols = c("grey", "blue"), reduction = "tsne")

FeaturePlot(object = seurat, features = genes2, cols = c("grey", "blue"), reduction = "umap")

pdf("./T_cell_markers_tSNE.pdf", w=11, h=8.5)
FeaturePlot(object = seurat, features = genes1, cols = c("grey", "blue"), reduction = "tsne")
dev.off()

## png 
##   2

pdf("./treg_markers_tSNE.pdf", w=11, h=8.5)
FeaturePlot(object = seurat, features = genes2, cols = c("grey", "blue"), reduction = "tsne")
dev.off()

## png 
##   2

pdf("./T_cell_markers_UMAP.pdf", w=11, h=8.5)
FeaturePlot(object = seurat, features = genes1, cols = c("grey", "blue"), reduction = "umap")
dev.off()

## png 
##   2

pdf("./treg_markers_UMAP.pdf", w=11, h=8.5)
FeaturePlot(object = seurat, features = genes2, cols = c("grey", "blue"), reduction = "umap")
dev.off()

## png 
##   2

NK cell markers

# NK cell markers
genes <- c("GZMB", "GZMH", "PRF1", "NKG7", "GNLY")
FeaturePlot(object = seurat, features = genes[1:2], cols = c("grey", "blue"), reduction = "tsne")

FeaturePlot(object = seurat, features = genes[3:4], cols = c("grey", "blue"), reduction = "tsne")

FeaturePlot(object = seurat, features = genes[5], cols = c("grey", "blue"), reduction = "tsne")

FeaturePlot(object = seurat, features = genes[1:3], cols = c("grey", "blue"), reduction = "umap")

FeaturePlot(object = seurat, features = genes[4:5], cols = c("grey", "blue"), reduction = "umap")

pdf("./NK_cell_markers.pdf", w=11, h=8.5)
FeaturePlot(object = seurat, features = genes, cols = c("grey", "blue"), reduction = "tsne")
FeaturePlot(object = seurat, features = genes, cols = c("grey", "blue"), reduction = "umap")
dev.off()

## png 
##   2

B cell markers

# B cell markers
genes <- genes <- c("CD19", "MS4A1", "CD79A")
FeaturePlot(object = seurat, features = genes, cols = c("grey", "blue"), reduction = "tsne")

FeaturePlot(object = seurat, features = genes, cols = c("grey", "blue"), reduction = "umap")

pdf("./B_cell_markers.pdf", w=11, h=8.5)
FeaturePlot(object = seurat, features = genes, cols = c("grey", "blue"), reduction = "tsne")
FeaturePlot(object = seurat, features = genes, cols = c("grey", "blue"), reduction = "umap")
dev.off()

## png 
##   2

Labeling cell subsets

We want to label each cell subset according to what cell type it is. We can update the identity slot to these new identities.

#current.cluster.ids <- c(0:14)
new.cluster.ids <- c("B cells", "CD4+ T cells", "2", "CD8+ T cells", "Monocytes/Macrophages", "Monocytes/Macrophages", "NK cells", "Monocytes/Macrophages", "Treg cells" , "9", "10", "11", "12", "13", "14") 

names(new.cluster.ids) <- levels(seurat)
seurat <- RenameIdents(seurat, new.cluster.ids)
DimPlot(seurat, reduction = "tsne", label = TRUE, pt.size = 0.1) + NoLegend()

pdf("./NewLabels.pdf", w=11, h=8.5)
DimPlot(seurat, reduction = "tsne", label = TRUE, pt.size = 0.1) + NoLegend()
dev.off()

## png 
##   2

How did this work? We took the original cluster identities (originally by default numbered) and mapped them to new values using the mapvalues command.

Using differentially expressed genes to help identify cell subsets

In the code above, we have identified several clusters of immune cells. However we also expect to see cell clusters consisting of stromal cells and/or malignant cells in this NSCLC sample. For some of these cell subsets, we may not have pre-existing marker genes that suggest the cell type. To help identify these cells, we look for genes that are differentially expressed in these clusters relative to all other clusters.

MIN_LOG2FOLD_CHANGE = 0.6 # set to minimum required average log fold change in gene expression.
MIN_PCT_CELLS_EXPR_GENE = .25  # minimum percent of cells that must express gene in either clstr.
#these values above are a little more stringent than the defaults
#you can set them to defaults, logfc.threshold = 0.25, min.pct = 0.1,  

# Here we find all markers for 
all.markers = FindAllMarkers(seurat,
                             min.pct = MIN_PCT_CELLS_EXPR_GENE,
                             logfc.threshold = MIN_LOG2FOLD_CHANGE,
                             only.pos = TRUE)

## Calculating cluster B cells

## Calculating cluster CD4+ T cells

## Calculating cluster 2

## Calculating cluster CD8+ T cells

## Calculating cluster Monocytes/Macrophages

## Calculating cluster NK cells

## Calculating cluster Treg cells

## Calculating cluster 9

## Calculating cluster 10

## Calculating cluster 11

## Calculating cluster 12

## Calculating cluster 13

## Calculating cluster 14

Here we found markers for all clusters using a DE test. There are other ways of selecting markers, feel free to read the original Seurat tutorial for more details. Here we use the Wilcoxon Rank Sum test, “wilcox”. There are many other options, including MAST.

Let’s look at the top markers for each cluster sorted by p-value.

# sort all the markers by p-value
all.markers.sortedByPval = all.markers[order(all.markers$p_val),]

#take a look at the top most significant markers
head(all.markers.sortedByPval)

##           p_val avg_log2FC pct.1 pct.2 p_val_adj cluster      gene
## MS4A1         0   3.473006 0.962 0.051         0 B cells     MS4A1
## CD79A         0   3.031872 0.963 0.093         0 B cells     CD79A
## BANK1         0   2.401089 0.839 0.059         0 B cells     BANK1
## CD19          0   1.936760 0.671 0.042         0 B cells      CD19
## LINC00926     0   1.873015 0.635 0.029         0 B cells LINC00926
## CD79B         0   1.844204 0.711 0.114         0 B cells     CD79B

#View(all.markers.sortedByPval)

Make a heatmap showing the top 10 marker genes for each cluster

We can also make heat maps of the top markers for each cluster.

top10 <- all.markers.sortedByPval %>%  group_by(cluster)  %>% do(head(., n=10))
DoHeatmap(object = seurat, features = top10$gene)+ NoLegend() +
    theme(axis.text.y = element_text(color = "black", size = 5))

pdf("./HeatMaptop10Markers.pdf", w=11, h=8.5)
DoHeatmap(object = seurat, features = top10$gene)+ NoLegend() +
    theme(axis.text.y = element_text(color = "black", size = 5))
dev.off()

## png 
##   2

Based on these differentially expressed genes, can you identify additional cell subsets?

You can save the object at this point so that it can easily be loaded back into an R session without having to rerun the computationally intensive steps performed above, or easily shared with collaborators.

save(seurat, all.markers, file = "~/nsclc.Rda")

# examine the top 4 markers in the context of the tSNE plots:
FeaturePlot(seurat, features = all.markers.sortedByPval$gene[1:4], reduction = "tsne")

Get genes uniquely differentially expressed in each cluster.

In the tests above for differentially expressed genes, some differentially expressed genes may be common across clusters. For example, the 3 clusters of T cells all have IL7R (T cell development gene) as a differentially expressed gene. To find genes that are differentially expressed and unique to the cell cluster, we can use the below code.

# Find differentially expressed genes that are only differentially expressed in one cluster of cells.
genes_uniquely_DE = all.markers.sortedByPval %>% dplyr::filter(avg_log2FC >= MIN_LOG2FOLD_CHANGE) %>% group_by(gene) %>%  summarize(n=n()) %>%  dplyr::filter(n==1)

# Sort DE genes by p value.
genes_uniquely_DE.markers.sortedByPval =
  all.markers.sortedByPval[all.markers.sortedByPval$gene %in% genes_uniquely_DE$gene & 
                             all.markers.sortedByPval$avg_log2FC >= MIN_LOG2FOLD_CHANGE,]
#View(genes_uniquely_DE.markers.sortedByPval)

# Get top 3 unique genes for each cell cluster.
top_marker_each = genes_uniquely_DE.markers.sortedByPval %>% dplyr::group_by(cluster) %>% do(head(., n=3))  # number of top markers for each cluster.

print(top_marker_each)

## Registered S3 method overwritten by 'cli':
##   method     from         
##   print.boxx spatstat.geom

## # A tibble: 36 x 7
## # Groups:   cluster [12]
##        p_val avg_log2FC pct.1 pct.2 p_val_adj cluster      gene  
##        <dbl>      <dbl> <dbl> <dbl>     <dbl> <fct>        <chr> 
##  1 0              3.47  0.962 0.051 0         B cells      MS4A1 
##  2 0              3.03  0.963 0.093 0         B cells      CD79A 
##  3 0              2.40  0.839 0.059 0         B cells      BANK1 
##  4 0              1.49  0.438 0.029 0         CD4+ T cells CD40LG
##  5 4.40e-186      1.13  0.361 0.056 1.48e-181 CD4+ T cells TCF7  
##  6 9.62e-181      0.662 0.996 0.995 3.24e-176 CD4+ T cells TPT1  
##  7 0              5.74  0.932 0.107 0         2            KRT17 
##  8 0              5.46  0.913 0.083 0         2            S100A2
##  9 0              5.16  0.741 0.02  0         2            LCN2  
## 10 6.76e- 77      0.933 0.251 0.046 2.28e- 72 CD8+ T cells NELL2 
## # … with 26 more rows

DoHeatmap(object = seurat, features = top_marker_each$gene)+ NoLegend() +
    theme(axis.text.y = element_text(color = "black", size = 5))

pdf("./HeatMaptop3Markers.pdf", w=11, h=8.5)
DoHeatmap(object = seurat, features = top_marker_each$gene)+ NoLegend() +
    theme(axis.text.y = element_text(color = "black", size = 5))
dev.off()

## png 
##   2

Based on these differentially expressed genes, can you identify additional cell subsets?

Visualizing marker genes: violin plots

We can look at marker gene expression using violin plots.

#uncomment the loop below if you want to make a plot for each of the marker genes.
#for (i in 1:length(top_marker_each$gene)) {
#  print(VlnPlot(seurat, features = top_marker_each$gene[i]), pt.size = 0.1) +NoLegend()
#}
#look at the firt 5 marker genes:
for (i in 1:5) {
  print(VlnPlot(seurat, features = top_marker_each$gene[i]), pt.size = 0.1) +NoLegend()
}

Other ways to visualize marker genes: dot plots

Dotplots are a concise way of exploring percent of cells expressing a gene and the gene expression intensity. This gives much of the same information a heatmap gives us without the resolution of looking at every cell. This can shield us from some of the noise caused by drop out and scales better with larger datasets.

DotPlot(seurat, features=unique(top_marker_each$gene),  dot.scale = 6)  + RotatedAxis() +theme(axis.text.x = element_text(color = "black", size = 7))

DotPlot(seurat, features=unique(top_marker_each$gene),  dot.scale = 6)  + RotatedAxis() +theme(axis.text.x = element_text(color = "black", size = 7)) + NoLegend()

pdf("./DotPlot.pdf", w=11, h=8.5)
DotPlot(seurat, features=unique(top_marker_each$gene),  dot.scale = 6)  + RotatedAxis() +theme(axis.text.x = element_text(color = "black", size = 7))
dev.off()

## png 
##   2

If you want other color scheme for your dotplots, i.e. blue-white-red with a black circle around theme, get in touch with us.

Print session info

# Save details of your session
print(sessionInfo(), locale=FALSE)

## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
## 
## Matrix products: default
## BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] patchwork_1.1.1    ggplot2_3.3.5      reshape_0.8.8      gdata_2.18.0      
## [5] Matrix_1.3-4       dplyr_1.0.7        SeuratObject_4.0.2 Seurat_4.0.3      
## 
## loaded via a namespace (and not attached):
##   [1] Rtsne_0.15            colorspace_2.0-2      deldir_0.2-10        
##   [4] ellipsis_0.3.2        ggridges_0.5.3        rstudioapi_0.13      
##   [7] spatstat.data_2.1-0   farver_2.1.0          leiden_0.3.8         
##  [10] listenv_0.8.0         ggrepel_0.9.1         fansi_0.5.0          
##  [13] codetools_0.2-18      splines_4.1.0         knitr_1.33           
##  [16] polyclip_1.10-0       jsonlite_1.7.2        ica_1.0-2            
##  [19] cluster_2.1.2         png_0.1-7             uwot_0.1.10          
##  [22] shiny_1.6.0           sctransform_0.3.2     spatstat.sparse_2.0-0
##  [25] compiler_4.1.0        httr_1.4.2            assertthat_0.2.1     
##  [28] fastmap_1.1.0         lazyeval_0.2.2        cli_3.0.0            
##  [31] later_1.2.0           htmltools_0.5.1.1     tools_4.1.0          
##  [34] igraph_1.2.6          gtable_0.3.0          glue_1.4.2           
##  [37] RANN_2.6.1            reshape2_1.4.4        Rcpp_1.0.6           
##  [40] scattermore_0.7       jquerylib_0.1.4       vctrs_0.3.8          
##  [43] nlme_3.1-152          lmtest_0.9-38         xfun_0.24            
##  [46] stringr_1.4.0         globals_0.14.0        mime_0.11            
##  [49] miniUI_0.1.1.1        lifecycle_1.0.0       irlba_2.3.3          
##  [52] gtools_3.9.2          goftest_1.2-2         future_1.21.0        
##  [55] MASS_7.3-54           zoo_1.8-9             scales_1.1.1         
##  [58] spatstat.core_2.2-0   promises_1.2.0.1      spatstat.utils_2.2-0 
##  [61] parallel_4.1.0        RColorBrewer_1.1-2    yaml_2.2.1           
##  [64] reticulate_1.20       pbapply_1.4-3         gridExtra_2.3        
##  [67] sass_0.4.0            rpart_4.1-15          stringi_1.6.2        
##  [70] highr_0.9             rlang_0.4.11          pkgconfig_2.0.3      
##  [73] matrixStats_0.59.0    evaluate_0.14         lattice_0.20-44      
##  [76] ROCR_1.0-11           purrr_0.3.4           tensor_1.5           
##  [79] labeling_0.4.2        htmlwidgets_1.5.3     cowplot_1.1.1        
##  [82] tidyselect_1.1.1      parallelly_1.26.1     RcppAnnoy_0.0.18     
##  [85] plyr_1.8.6            magrittr_2.0.1        R6_2.5.0             
##  [88] generics_0.1.0        DBI_1.1.1             withr_2.4.2          
##  [91] pillar_1.6.1          mgcv_1.8-36           fitdistrplus_1.1-5   
##  [94] survival_3.2-11       abind_1.4-5           tibble_3.1.2         
##  [97] future.apply_1.7.0    crayon_1.4.1          KernSmooth_2.23-20   
## [100] utf8_1.2.1            spatstat.geom_2.2-0   plotly_4.9.4.1       
## [103] rmarkdown_2.9         grid_4.1.0            data.table_1.14.0    
## [106] digest_0.6.27         xtable_1.8-4          tidyr_1.1.3          
## [109] httpuv_1.6.1          munsell_0.5.0         viridisLite_0.4.0    
## [112] bslib_0.2.5.1