0.单细胞的TPM数据
众所周知,正常的单细胞数据应该给我们提供count,但偏偏有一些数据是特殊的,给的是tpm数据。准确来说应该是log2(tpm/10+1)
为啥除以10,这么解释的: TPM values were divided by 10 since we estimate the complexity of our single cell libraries to be on the order of 100,000 transcripts and would like to avoid counting each transcript ~10 times, as would be the case with TPM, which may inflate the difference between the expression level of a gene in cells in which the gene is detected and those in which it is not detected.
查阅seurat的github,对tpm数据的说明是
If you have TPM data, you can simply manually log transform the gene expression matrix in the object@data slot before scaling the data. You have to replace your object@data slot with the desired gene expression matrix as follows: sce.all@data = log(x = norm + 1))
在Seurat v5 版本之前,直接跳过NormalizeData 这个函数即可:
Yes, run CreateSeuratObject() and skip the data normalization step since you’re starting with normalized data
但是呢,当我们使用Seurat v5 ,粗暴的跳过NormalizeData 是会报错的。例如:
dat = data.table::fread("GSE72056_melanoma_single_cell_revised_v2.txt.gz",data.table = F)
dat[1:4,1:4]
## Cell
## 1 tumor
## 2 malignant(1=no,2=yes,0=unresolved)
## 3 non-malignant cell type (1=T,2=B,3=Macro.4=Endo.,5=CAF;6=NK)
## 4 C9orf152
## Cy72_CD45_H02_S758_comb CY58_1_CD45_B02_S974_comb Cy71_CD45_D08_S524_comb
## 1 72 58 71
## 2 1 1 2
## 3 2 1 0
## 4 0 0 0
dat = dat[-(1:3),]
library(tidyverse)
dat = distinct(dat,Cell,.keep_all = T)
library(tibble)
dat = column_to_rownames(dat,"Cell")
dat[1:4,1:4]
## Cy72_CD45_H02_S758_comb CY58_1_CD45_B02_S974_comb
## C9orf152 0.0000 0.0000
## RPS11 9.2172 8.3745
## ELMO2 0.0000 0.0000
## CREB3L1 0.0000 0.0000
## Cy71_CD45_D08_S524_comb Cy81_FNA_CD45_B01_S301_comb
## C9orf152 0.0000 0.0000
## RPS11 9.3130 7.8876
## ELMO2 2.1263 0.0000
## CREB3L1 0.0000 0.0000
library(Seurat)
sce.all <- CreateSeuratObject(counts = dat,
project = "a",
min.cells = 3,
min.features = 200)
sce.all[["percent.mt"]] <- PercentageFeatureSet(sce.all, pattern = "^MT-")
head(sce.all@meta.data, 3)
## orig.ident nCount_RNA nFeature_RNA percent.mt
## Cy72_CD45_H02_S758_comb a 7143.363 3365 0
## CY58_1_CD45_B02_S974_comb a 8915.833 3637 0
## Cy71_CD45_D08_S524_comb a 12578.256 4660 0
VlnPlot(sce.all,
features = c("nFeature_RNA",
"nCount_RNA",
"percent.mt"),
ncol = 3,pt.size = 0.5)
image.png
线粒体基因比例为0,另两个图也没有很利群的值,这个数据可以不过滤。
当我们企图跳过NormalizeData就会报错
sce.all = sce.all %>%
#NormalizeData() %>%
FindVariableFeatures() %>%
ScaleData()
## Error in `ScaleData()`:
## ! No layer matching pattern 'data' found. Please run NormalizeData and retry
明明4版本这样做是木有问题的嘞。
各种细枝末节都可以在seurat github页面上找到答案:
In the previous versions, “data” was automatically populated with the raw counts prior to running NormalizeData, which is no longer the case in v5. Therefore, if you run ScaleData and the “data” layer does not exist, this will not work. If you want to run ScaleData on the raw counts (which I wouldn’t recommend doing in most cases), you could do the following: sce.all_small <- ScaleData(sce.all_small, layer = “counts”) https://github.com/satijalab/seurat/issues/8003
所以解决办法就是加上个参数layer = “counts”。虽然叫counts但我们创建对象的时候放了个什么东西进去自己心里没点数嘛?那其实是tpm😄。
sce.all = sce.all %>%
#NormalizeData() %>%
FindVariableFeatures() %>%
ScaleData(layer = "counts") %>%
RunPCA(pc.genes = pbmc@var.genes) %>%
FindNeighbors(dims = 1:15) %>%
FindClusters(resolution = 0.5) %>%
RunUMAP(dims = 1:15)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 4645
## Number of edges: 150415
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9164
## Number of communities: 16
## Elapsed time: 0 seconds
后面就可以正常进行啦,顺便把这个数据做完
# 注释
library(celldex)
library(SingleR)
#ref <- celldex::HumanPrimaryCellAtlasData()
#因为下载太慢,使用了提前存好的本地数据
ref <- get(load("ref_Human_all.RData"))
library(BiocParallel)
pred.scRNA <- SingleR(test = sce.all@assays$RNA$counts,
ref = ref,
labels = ref$label.main,
clusters = sce.all@active.ident)
pred.scRNA$pruned.labels
## [1] "T_cells" "T_cells" "B_cell"
## [4] "Neurons" "Neurons" "T_cells"
## [7] "Monocyte" "T_cells" "Tissue_stem_cells"
## [10] "Neurons" "Neurons" "Endothelial_cells"
## [13] "T_cells" "Fibroblasts" "T_cells"
## [16] "Neurons"
plotScoreHeatmap(pred.scRNA, clusters=pred.scRNA@rownames, fontsize.row = 9,show_colnames = T)
new.cluster.ids <- pred.scRNA$pruned.labels
names(new.cluster.ids) <- levels(sce.all)
levels(sce.all)
## [1] "0" "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14"
## [16] "15"
sce.all <- RenameIdents(sce.all,new.cluster.ids)
levels(sce.all)
## [1] "T_cells" "B_cell" "Neurons"
## [4] "Monocyte" "Tissue_stem_cells" "Endothelial_cells"
## [7] "Fibroblasts"
p = UMAPPlot(object = sce.all, pt.size = 0.5, label = TRUE)
p