单细胞学习小组003期 Day5

原创

用户11153857

发布于 2024-07-03 10:17:03

890

发布于 2024-07-03 10:17:03

文章被收录于专栏：花花单细胞学习小组003

This is the homework of Huahua's scRNA-seq study group003 Day5.

The content of today is the processing of single-cell RNA-seq data.

Device name HD72201200

Full device name HD72201200

Processor 12th Gen Intel(R) Core(TM) i5-12500T 2.00 GHz

Installed RAM 16.0 GB (15.7 GB usable)

Device ID E913FFA8-45A6-4713-9BA1-9A62107A2006

Product ID 00329-00000-00003-AA309

System type 64-bit operating system, x64-based processor

Pen and touch No pen or touch input is available for this display

1. Download and organize data

Demo data: GSE231920

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE231920

This data set is prety fresh, which was uploaded in May of 2014 by a research group in Shanghai.

In the above screen snapshot, "custom" to show all seperated supplementary files.

Pick up the first sample, named 7306054_sample1, in 3 gz files including barcodes, features, and matrix. Then download them to local hard disk.

1.1 Unpacking Files

Downloaded file is a tar file. Copy it to working directory.

2 options for unpacking the tar file, using software in windows or Mac OS, or by R.

untar("GSE231920_RAW.tar",exdir = "input") # unzip the file "", exdir(解压为)= “文件夹名”
dir("input")   # dir等于list.file, 相当于ls，列出work directory里的所有文件

1.2 Requirements for single-cell file organisation

Three 10X files are kept in one folder. File names are fixed without prefix:

“barcodes.tsv.gz” for barcodes in DNA library index, each barcode for each single cell theoretically, will be the column name of the expression matrix.

“features.tsv.gz” for gene names, will be the row names of the expression matrix, could be ”genes.tsv.gz”.

“matrix.mtx.gz” for the value, only save non-zero value (save volume for the file).

If there are multiple samples, one folder is for one sample. The sample name can only be the folder name since the three names in each folder are the same.

1.3 Changing the file name

The data downloaded from GEO is named by sample ID as prefix, so their name need to be changed to standard ones mentioned above.

Renaming mannually or by R codes:

library(stringr)
nn = str_remove(dir("input/"),"GSM7306054_sample1_")
nn

file.rename(paste0("input/",dir("input/")),
            paste0("input/",nn))

dir("input/")

Q: The size of file is 0B. Is it correct?

2. Read and create Seurat objects

2.1 Reading a file

Function for reading data is Read10X.

rm(list = ls())  # 清空右上角的所有变量，方便反复调试代码
library(Seurat)
library(patchwork)
library(tidyverse)
ct = Read10X("input/")
dim(ct)

dim for colunm number and row number.

2.2 R supplementary knowledge. what is a sparse matrix?

A sparse matrix is a special case of a matrix in which the number of zero elements is much higher than the number of non-zero elements. As a rule of thumb, if 2/3 of the total elements in a matrix are zeros, it can be called a sparse matrix.

class(ct)
ct[c("CD3D","TCL1A","MS4A1"), 1:30]

2.3 Creating Seurat objects

Create one Seurat object, and all the following operation is on a Seurat object.

seu.obj <- CreateSeuratObject(counts = ct, 
                           min.cells = 3, # A gene must be expressed in at least three cells to be preserved
                           min.features = 200)  #At least 200 genes have to be expressed in a cell to be preserved
dim(seu.obj)  #the dim has changed and the number of rows and columns has decreased

2.4 Cell sampling

set.seed(1234)  # a random seed designed to make subsequent random samples repeatable
seu.obj = subset(seu.obj,downsample = 3000) #, as long as the number in setseed brackets does not change over multiple runs, the sampling result will not change

2.5 R language supplements knowledge ☞ objects

class(seu.obj)

The output means : this is a Seurat object, original of SeuratObject.

Compare the following two results:

Important tips:

要提取它的子集，有@和$两个符号，或者是从右边点开看，或者是用代码查看。一般第一层次都是@，后面的就不一定了

2.6 Explore meta information of Seurat objects

Two options:

seu.obj@meta.data %>% head()

seu.obj[[]] %>% head()

2.7 R language supplementary knowledge Pipeline symbols: %>%

%>%是向后传递，即：把%>%前面的数据传递给%>%后面的函数，作为该函数的第一个参数。

3. Quality control

3.1 Quality control indicators and reasons

Control numbers of mitochondrial genes,

and nFeature_RNA.

Explaination: nFeature_RNA是每个细胞中检测到的基因数量。nCount_RNA是细胞内检测到的分子总数。nFeature_RNA过低，表示该细胞可能已死/将死或是空液滴。太高的nCount_RNA和/或nFeature_RNA表明“细胞”实际上可以是两个或多个细胞。结合线粒体基因count数除去异常值，即可除去大多数双峰/死细胞/空液滴，因此它们过滤是常见的预处理步骤。

https://www.biostars.org/p/407036/

3.2 Calculation of mitochondrial gene ratio

pattern = “^MT-”, using regular expression to match MT- prefix genes

seu.obj[["percent.mt"]] <- PercentageFeatureSet(seu.obj, pattern = "^MT-")
head(seu.obj@meta.data)

A new colunm of "percent.mt" was added to meta.data

VlnPlot(seu.obj, 
        features = c("nFeature_RNA", 
                     "nCount_RNA", 
                     "percent.mt"), 
        ncol = 3,pt.size = 0.5)

Example of filteration criteria:

seu.obj = subset(seu.obj,
                nFeature_RNA < 6000 &
                nCount_RNA < 30000 &
                percent.mt < 18)

4. Dimensionality reduction clustering

4.1 Understanding dimension reduction

Expalination: copied from huahua's text. 对于表达矩阵来说，一个基因就是一个维度，有几万个维度，太多了，需要减少，称之为降维。

PCA是主成分分析，完成线性的降维，会把几万个维度转换为几十个新的维度，而UMAP和tSNE是非线性的降维，线性降维不够彻底，整点高级的，进一步降维。

UMAP 和tSNE二选一就行，没有必须用哪个的说法，只能说UMAP用的多一些，图一般会紧凑一些。

4.2 R Language Knowledge supplement : If present, skip

file.exists #判断一个文件在工作目录下是否存在的函数

f = "obj.Rdata"
if(!file.exists(f)){
  seu.obj = seu.obj %>% 
  NormalizeData() %>%  
  FindVariableFeatures() %>%  
  ScaleData(features = rownames(.)) %>%  
  RunPCA(features = VariableFeatures(.))  %>%
  FindNeighbors(dims = 1:15) %>% 
  FindClusters(resolution = 0.5) %>% 
  RunUMAP(dims = 1:15) %>% 
  RunTSNE(dims = 1:15)
  save(seu.obj,file = f)
}
load(f)
ElbowPlot(seu.obj)

p1 <- DimPlot(seu.obj, reduction = "umap",label = T)+NoLegend();p1

5. Homework for today, choose another dataset (GSM7306055_sample2), run the whole processe and get all results.

untar("GSE231920_RAW(1).tar",exdir = "input2")

dir("input2")

library(stringr)
nn = str_remove(dir("input2/"),"GSM7306055_sample2_")
nn

file.rename(paste0("input2/",dir("input2/")),
            paste0("input2/",nn))
dir("input2/")

ct = Read10X("input2/")
dim(ct)

class(ct)

seu.obj <- CreateSeuratObject(counts = ct, 
                              min.cells = 3, 
                              min.features = 200)
dim(seu.obj)

seu.obj@meta.data %>% head()   # seu.obj[[]] %>% head()

seu.obj[["percent.mt"]] <- PercentageFeatureSet(seu.obj, pattern = "^MT-")
head(seu.obj@meta.data)


VlnPlot(seu.obj, 
        features = c("nFeature_RNA",
                     "nCount_RNA", 
                     "percent.mt"), 
        ncol = 3,pt.size = 0.5)


f = "obj.Rdata"
if(!file.exists(f)){
  seu.obj = seu.obj %>% 
    NormalizeData() %>%  
    FindVariableFeatures() %>%  
    ScaleData(features = rownames(.)) %>%  
    RunPCA(features = VariableFeatures(.))  %>%
    FindNeighbors(dims = 1:15) %>% 
    FindClusters(resolution = 0.5) %>% 
    RunUMAP(dims = 1:15) %>% 
    RunTSNE(dims = 1:15)
  save(seu.obj,file = f)
}

load(f)
ElbowPlot(seu.obj)

p1 <- DimPlot(seu.obj, reduction = "umap",label = T)+NoLegend();p1

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

学习笔记

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

学习笔记

登录后参与评论

0 条评论

热度