前往小程序,Get更优阅读体验!
立即前往
发布
社区首页 >专栏 >单细胞TPM数据遇上Seurat v5

单细胞TPM数据遇上Seurat v5

作者头像
用户11414625
发布2024-12-20 15:02:33
发布2024-12-20 15:02:33
9700
代码可运行
举报
文章被收录于专栏:生信星球520生信星球520
运行总次数:0
代码可运行

0.单细胞的TPM数据

众所周知,正常的单细胞数据应该给我们提供count,但偏偏有一些数据是特殊的,给的是tpm数据。准确来说应该是log2(tpm/10+1)

为啥除以10,这么解释的: TPM values were divided by 10 since we estimate the complexity of our single cell libraries to be on the order of 100,000 transcripts and would like to avoid counting each transcript ~10 times, as would be the case with TPM, which may inflate the difference between the expression level of a gene in cells in which the gene is detected and those in which it is not detected.

查阅seurat的github,对tpm数据的说明是

If you have TPM data, you can simply manually log transform the gene expression matrix in the object@data slot before scaling the data. You have to replace your object@data slot with the desired gene expression matrix as follows: sce.all@data = log(x = norm + 1))

在Seurat v5 版本之前,直接跳过NormalizeData 这个函数即可:

Yes, run CreateSeuratObject() and skip the data normalization step since you’re starting with normalized data

但是呢,当我们使用Seurat v5 ,粗暴的跳过NormalizeData 是会报错的。例如:

1.读取数据
代码语言:javascript
代码运行次数:0
复制
dat = data.table::fread("GSE72056_melanoma_single_cell_revised_v2.txt.gz",data.table = F)
dat[1:4,1:4]

##                                                           Cell
## 1                                                        tumor
## 2                           malignant(1=no,2=yes,0=unresolved)
## 3 non-malignant cell type (1=T,2=B,3=Macro.4=Endo.,5=CAF;6=NK)
## 4                                                     C9orf152
##   Cy72_CD45_H02_S758_comb CY58_1_CD45_B02_S974_comb Cy71_CD45_D08_S524_comb
## 1                      72                        58                      71
## 2                       1                         1                       2
## 3                       2                         1                       0
## 4                       0                         0                       0

dat = dat[-(1:3),]
library(tidyverse)
dat = distinct(dat,Cell,.keep_all = T)
library(tibble)
dat = column_to_rownames(dat,"Cell")
dat[1:4,1:4]

##          Cy72_CD45_H02_S758_comb CY58_1_CD45_B02_S974_comb
## C9orf152                  0.0000                    0.0000
## RPS11                     9.2172                    8.3745
## ELMO2                     0.0000                    0.0000
## CREB3L1                   0.0000                    0.0000
##          Cy71_CD45_D08_S524_comb Cy81_FNA_CD45_B01_S301_comb
## C9orf152                  0.0000                      0.0000
## RPS11                     9.3130                      7.8876
## ELMO2                     2.1263                      0.0000
## CREB3L1                   0.0000                      0.0000
2.正常创建对象
代码语言:javascript
代码运行次数:0
复制
library(Seurat)
sce.all <- CreateSeuratObject(counts = dat, 
                           project = "a", 
                           min.cells = 3, 
                           min.features = 200)
3.质控
代码语言:javascript
代码运行次数:0
复制
sce.all[["percent.mt"]] <- PercentageFeatureSet(sce.all, pattern = "^MT-")
head(sce.all@meta.data, 3)

##                           orig.ident nCount_RNA nFeature_RNA percent.mt
## Cy72_CD45_H02_S758_comb            a   7143.363         3365          0
## CY58_1_CD45_B02_S974_comb          a   8915.833         3637          0
## Cy71_CD45_D08_S524_comb            a  12578.256         4660          0

VlnPlot(sce.all, 
        features = c("nFeature_RNA",
                     "nCount_RNA", 
                     "percent.mt"), 
        ncol = 3,pt.size = 0.5)

image.png

线粒体基因比例为0,另两个图也没有很利群的值,这个数据可以不过滤。

4.企图跳过NormalizeData

当我们企图跳过NormalizeData就会报错

代码语言:javascript
代码运行次数:0
复制
sce.all = sce.all %>% 
  #NormalizeData() %>%  
  FindVariableFeatures() %>%  
  ScaleData()

## Error in `ScaleData()`:
## ! No layer matching pattern 'data' found. Please run NormalizeData and retry

明明4版本这样做是木有问题的嘞。

5.查呀!

各种细枝末节都可以在seurat github页面上找到答案:

In the previous versions, “data” was automatically populated with the raw counts prior to running NormalizeData, which is no longer the case in v5. Therefore, if you run ScaleData and the “data” layer does not exist, this will not work. If you want to run ScaleData on the raw counts (which I wouldn’t recommend doing in most cases), you could do the following: sce.all_small <- ScaleData(sce.all_small, layer = “counts”) https://github.com/satijalab/seurat/issues/8003

所以解决办法就是加上个参数layer = “counts”。虽然叫counts但我们创建对象的时候放了个什么东西进去自己心里没点数嘛?那其实是tpm😄。

代码语言:javascript
代码运行次数:0
复制
sce.all = sce.all %>% 
  #NormalizeData() %>%  
  FindVariableFeatures() %>%  
  ScaleData(layer = "counts") %>%
  RunPCA(pc.genes = pbmc@var.genes)  %>%
  FindNeighbors(dims = 1:15) %>% 
  FindClusters(resolution = 0.5) %>% 
  RunUMAP(dims = 1:15) 

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 4645
## Number of edges: 150415
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9164
## Number of communities: 16
## Elapsed time: 0 seconds

后面就可以正常进行啦,顺便把这个数据做完

6.还能做个singleR
代码语言:javascript
代码运行次数:0
复制
# 注释
library(celldex)
library(SingleR)
#ref <- celldex::HumanPrimaryCellAtlasData()
#因为下载太慢,使用了提前存好的本地数据
ref <- get(load("ref_Human_all.RData"))
library(BiocParallel)
pred.scRNA <- SingleR(test = sce.all@assays$RNA$counts, 
                      ref = ref,
                      labels = ref$label.main, 
                      clusters = sce.all@active.ident)
pred.scRNA$pruned.labels

##  [1] "T_cells"           "T_cells"           "B_cell"           
##  [4] "Neurons"           "Neurons"           "T_cells"          
##  [7] "Monocyte"          "T_cells"           "Tissue_stem_cells"
## [10] "Neurons"           "Neurons"           "Endothelial_cells"
## [13] "T_cells"           "Fibroblasts"       "T_cells"          
## [16] "Neurons"

plotScoreHeatmap(pred.scRNA, clusters=pred.scRNA@rownames, fontsize.row = 9,show_colnames = T)
代码语言:javascript
代码运行次数:0
复制
new.cluster.ids <- pred.scRNA$pruned.labels
names(new.cluster.ids) <- levels(sce.all)
levels(sce.all)

##  [1] "0"  "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14"
## [16] "15"

sce.all <- RenameIdents(sce.all,new.cluster.ids)
levels(sce.all)

## [1] "T_cells"           "B_cell"            "Neurons"          
## [4] "Monocyte"          "Tissue_stem_cells" "Endothelial_cells"
## [7] "Fibroblasts"

p = UMAPPlot(object = sce.all, pt.size = 0.5, label = TRUE)
p
本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2024-04-06,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 生信星球 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 1.读取数据
  • 2.正常创建对象
  • 3.质控
  • 4.企图跳过NormalizeData
  • 5.查呀!
  • 6.还能做个singleR
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档