This is the homework of Huahua's scRNA-seq study group003 Day5.
The content of today is the processing of single-cell RNA-seq data.
Device name HD72201200
Full device name HD72201200
Processor 12th Gen Intel(R) Core(TM) i5-12500T 2.00 GHz
Installed RAM 16.0 GB (15.7 GB usable)
Device ID E913FFA8-45A6-4713-9BA1-9A62107A2006
Product ID 00329-00000-00003-AA309
System type 64-bit operating system, x64-based processor
Pen and touch No pen or touch input is available for this display
Demo data: GSE231920
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE231920
This data set is prety fresh, which was uploaded in May of 2014 by a research group in Shanghai.
In the above screen snapshot, "custom" to show all seperated supplementary files.
Pick up the first sample, named 7306054_sample1, in 3 gz files including barcodes, features, and matrix. Then download them to local hard disk.
Downloaded file is a tar file. Copy it to working directory.
2 options for unpacking the tar file, using software in windows or Mac OS, or by R.
untar("GSE231920_RAW.tar",exdir = "input") # unzip the file "", exdir(解压为)= “文件夹名”
dir("input") # dir等于list.file, 相当于ls,列出work directory里的所有文件
Three 10X files are kept in one folder. File names are fixed without prefix:
“barcodes.tsv.gz” for barcodes in DNA library index, each barcode for each single cell theoretically, will be the column name of the expression matrix.
“features.tsv.gz” for gene names, will be the row names of the expression matrix, could be ”genes.tsv.gz”.
“matrix.mtx.gz” for the value, only save non-zero value (save volume for the file).
If there are multiple samples, one folder is for one sample. The sample name can only be the folder name since the three names in each folder are the same.
The data downloaded from GEO is named by sample ID as prefix, so their name need to be changed to standard ones mentioned above.
Renaming mannually or by R codes:
library(stringr)
nn = str_remove(dir("input/"),"GSM7306054_sample1_")
nn
file.rename(paste0("input/",dir("input/")),
paste0("input/",nn))
dir("input/")
Q: The size of file is 0B. Is it correct?
Function for reading data is Read10X.
rm(list = ls()) # 清空右上角的所有变量,方便反复调试代码
library(Seurat)
library(patchwork)
library(tidyverse)
ct = Read10X("input/")
dim(ct)
dim for colunm number and row number.
A sparse matrix is a special case of a matrix in which the number of zero elements is much higher than the number of non-zero elements. As a rule of thumb, if 2/3 of the total elements in a matrix are zeros, it can be called a sparse matrix.
class(ct)
ct[c("CD3D","TCL1A","MS4A1"), 1:30]
Create one Seurat object, and all the following operation is on a Seurat object.
seu.obj <- CreateSeuratObject(counts = ct,
min.cells = 3, # A gene must be expressed in at least three cells to be preserved
min.features = 200) #At least 200 genes have to be expressed in a cell to be preserved
dim(seu.obj) #the dim has changed and the number of rows and columns has decreased
set.seed(1234) # a random seed designed to make subsequent random samples repeatable
seu.obj = subset(seu.obj,downsample = 3000) #, as long as the number in setseed brackets does not change over multiple runs, the sampling result will not change
class(seu.obj)
The output means : this is a Seurat object, original of SeuratObject.
Compare the following two results:
Important tips:
要提取它的子集,有@和$两个符号,或者是从右边点开看,或者是用代码查看。一般第一层次都是@,后面的就不一定了
Two options:
seu.obj@meta.data %>% head()
seu.obj[[]] %>% head()
%>%是向后传递,即:把%>%前面的数据传递给%>%后面的函数,作为该函数的第一个参数。
Control numbers of mitochondrial genes,
and nFeature_RNA.
Explaination: nFeature_RNA是每个细胞中检测到的基因数量。nCount_RNA是细胞内检测到的分子总数。nFeature_RNA过低,表示该细胞可能已死/将死或是空液滴。太高的nCount_RNA和/或nFeature_RNA表明“细胞”实际上可以是两个或多个细胞。结合线粒体基因count数除去异常值,即可除去大多数双峰/死细胞/空液滴,因此它们过滤是常见的预处理步骤。
https://www.biostars.org/p/407036/
pattern = “^MT-”, using regular expression to match MT- prefix genes
seu.obj[["percent.mt"]] <- PercentageFeatureSet(seu.obj, pattern = "^MT-")
head(seu.obj@meta.data)
A new colunm of "percent.mt" was added to meta.data
VlnPlot(seu.obj,
features = c("nFeature_RNA",
"nCount_RNA",
"percent.mt"),
ncol = 3,pt.size = 0.5)
Example of filteration criteria:
seu.obj = subset(seu.obj,
nFeature_RNA < 6000 &
nCount_RNA < 30000 &
percent.mt < 18)
Expalination: copied from huahua's text. 对于表达矩阵来说,一个基因就是一个维度,有几万个维度,太多了,需要减少,称之为降维。
PCA是主成分分析,完成线性的降维,会把几万个维度转换为几十个新的维度,而UMAP和tSNE是非线性的降维,线性降维不够彻底,整点高级的,进一步降维。
UMAP 和tSNE二选一就行,没有必须用哪个的说法,只能说UMAP用的多一些,图一般会紧凑一些。
file.exists #判断一个文件在工作目录下是否存在的函数
f = "obj.Rdata"
if(!file.exists(f)){
seu.obj = seu.obj %>%
NormalizeData() %>%
FindVariableFeatures() %>%
ScaleData(features = rownames(.)) %>%
RunPCA(features = VariableFeatures(.)) %>%
FindNeighbors(dims = 1:15) %>%
FindClusters(resolution = 0.5) %>%
RunUMAP(dims = 1:15) %>%
RunTSNE(dims = 1:15)
save(seu.obj,file = f)
}
load(f)
ElbowPlot(seu.obj)
p1 <- DimPlot(seu.obj, reduction = "umap",label = T)+NoLegend();p1
untar("GSE231920_RAW(1).tar",exdir = "input2")
dir("input2")
library(stringr)
nn = str_remove(dir("input2/"),"GSM7306055_sample2_")
nn
file.rename(paste0("input2/",dir("input2/")),
paste0("input2/",nn))
dir("input2/")
ct = Read10X("input2/")
dim(ct)
class(ct)
seu.obj <- CreateSeuratObject(counts = ct,
min.cells = 3,
min.features = 200)
dim(seu.obj)
seu.obj@meta.data %>% head() # seu.obj[[]] %>% head()
seu.obj[["percent.mt"]] <- PercentageFeatureSet(seu.obj, pattern = "^MT-")
head(seu.obj@meta.data)
VlnPlot(seu.obj,
features = c("nFeature_RNA",
"nCount_RNA",
"percent.mt"),
ncol = 3,pt.size = 0.5)
f = "obj.Rdata"
if(!file.exists(f)){
seu.obj = seu.obj %>%
NormalizeData() %>%
FindVariableFeatures() %>%
ScaleData(features = rownames(.)) %>%
RunPCA(features = VariableFeatures(.)) %>%
FindNeighbors(dims = 1:15) %>%
FindClusters(resolution = 0.5) %>%
RunUMAP(dims = 1:15) %>%
RunTSNE(dims = 1:15)
save(seu.obj,file = f)
}
load(f)
ElbowPlot(seu.obj)
p1 <- DimPlot(seu.obj, reduction = "umap",label = T)+NoLegend();p1
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。