我们生信马拉松授课群里有个学员问空间转录组数据如何读取,给出的数据编号为 GSE190811,由于不是10x spaceranger 的标准格式,对于初学者来说有一些难度,下面一起来看看这个数据集!
GSE190811 这个数据来自2022年11月10号发表在NC杂志上,标题为《Single cell profiling of primary and paired metastatic lymph node tumors in breast cancer patients》,文章仅用这个数据绘制了一幅图:上皮细胞的特征基因在空间切片上的打分分布。
这个数据的GEO链接为:https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE190811,总共是4个空间转录组的样本:
GSM5732357 PT_3 LNM ST
GSM5732358 PT_6 LNM ST
GSM5732359 PT_7 LNM ST
GSM5732360 PT_8 LNM ST
提供的附件 GSE190811_RAW.tar 包含的文件如下:
GSM5732357_A_1938345_11.jpg.gz 52.6 Mb
GSM5732357_A_raw_feature_bc_matrix.h5 29.6 Mb
GSM5732357_A_scalefactors_json.json.gz 161 b
GSM5732357_A_tissue_hires_image.png.gz 14.5 Mb
GSM5732357_A_tissue_positions_list.csv.gz 66.5 Kb
GSM5732358_B_1938529_9.jpg.gz 68.6 Mb
GSM5732358_B_raw_feature_bc_matrix.h5 34.7 Mb
GSM5732358_B_scalefactors_json.json.gz 164 b
GSM5732358_B_tissue_hires_image.png.gz 16.9 Mb
GSM5732358_B_tissue_positions_list.csv.gz 64.4 Kb
GSM5732359_C_2000752_23.jpg.gz 75.1 Mb
GSM5732359_C_raw_feature_bc_matrix.h5 40.5 Mb
GSM5732359_C_scalefactors_json.json.gz 164 b
GSM5732359_C_tissue_hires_image.png.gz 17.3 Mb
GSM5732359_C_tissue_positions_list.csv.gz 62.6 Kb
GSM5732360_D_2000910_33.jpg.gz 65.7 Mb
GSM5732360_D_raw_feature_bc_matrix.h5 38.5 Mb
GSM5732360_D_scalefactors_json.json.gz 160 b
GSM5732360_D_tissue_hires_image.png.gz 16.1 Mb
GSM5732360_D_tissue_positions_list.csv.gz 59.7 Kb
{"tissue_hires_scalef": 0.06628223, "tissue_lowres_scalef": 0.019884668, "fiducial_diameter_fullres": 383.95734, "spot_diameter_fullres": 237.68787000000003}
下载下来后整理成如下格式:
├── GSM5732357_A
│ ├── 1938345_11.jpg
│ ├── raw_feature_bc_matrix.h5
│ ├── scalefactors_json.json
│ ├── tissue_hires_image.png
│ └── tissue_positions_list.csv
├── GSM5732358_B
│ ├── 1938529_9.jpg
│ ├── raw_feature_bc_matrix.h5
│ ├── scalefactors_json.json
│ ├── tissue_hires_image.png
│ └── tissue_positions_list.csv
├── GSM5732359_C
│ ├── 2000752_23.jpg
│ ├── raw_feature_bc_matrix.h5
│ ├── scalefactors_json.json
│ ├── tissue_hires_image.png
│ └── tissue_positions_list.csv
├── GSM5732360_D
│ ├── 2000910_33.jpg
│ ├── raw_feature_bc_matrix.h5
│ ├── scalefactors_json.json
│ ├── tissue_hires_image.png
│ └── tissue_positions_list.csv
这里用到的文件为 raw_feature_bc_matrix.h5,以及 图片tissue_hires_image.png:
###
### Create: Jianming Zeng
### Date: 2023-12-31
### Email: jmzeng1314@163.com
### Blog: http://www.bio-info-trainee.com/
### Forum: http://www.biotrainee.com/thread-1376-1-1.html
### CAFS/SUSTC/Eli Lilly/University of Macau
### Update Log: 2023-12-31 First version
### Update Log: 2024-12-09 by juan zhang (492482942@qq.com)
###
rm(list=ls())
options(stringsAsFactors = F)
library(ggsci)
library(dplyr)
library(future)
library(Seurat)
library(clustree)
library(cowplot)
library(data.table)
library(ggplot2)
library(patchwork)
library(stringr)
library(qs)
library(Matrix)
getwd()
# 创建目录
getwd()
gse <- "GSE190811"
dir.create(gse)
################## 读取数据,有h5与tissue_hires_image.png
samples <- list.dirs("GSE190811/", recursive = F, full.names = F)
samples
scRNAlist <- lapply(samples, function(pro){
# pro <- samples[1]
print(pro)
# 先读取 h5
data <- Read10X_h5(filename = paste0("GSE190811/",pro,"/raw_feature_bc_matrix.h5"))
dim(data)
data[1:5,1:5]
object <- CreateSeuratObject(counts = data, assay = "Spatial", min.cells = 3, project = pro)
object
# 再读取
image <- Read10X_Image(image.dir = paste0("GSE190811/",pro),
image.name = "tissue_hires_image.png",
filter.matrix = TRUE)
image
dim(image)
image <- image[Cells(x = object)]
DefaultAssay(object = image) <- "Spatial"
# 添加图片到object中
object[[pro]] <- image
object@images[[1]]@scale.factors$lowres <- object@images[[1]]@scale.factors$hires
return(object)
})
names(scRNAlist) <- samples
scRNAlist
# merge
sce.all <- merge(scRNAlist[[1]], y=scRNAlist[-1], add.cell.ids=samples)
sce.all <- JoinLayers(sce.all) # seurat v5
sce.all
## 简单探索一下数据结构
as.data.frame(sce.all[["Spatial"]]$counts[1:4,1:4])
as.data.frame(LayerData(sce.all, assay = "Spatial", layer = "counts")[1:5,1:5])
head(sce.all@meta.data)
table(sce.all$orig.ident)
Layers(sce.all)
Assays(sce.all)
library(qs)
qsave(sce.all, file="GSE190811/sce.all.qs")
每个样本中 nCount_Spatial
和nFeature_Spatial
的小提琴分布:
# Data preprocessing
plot1 <- VlnPlot(sce.all, features = c("nCount_Spatial","nFeature_Spatial"), pt.size = 0.1) + NoLegend()
plot1
ggsave(filename = "1-QC/VlnPlot.pdf", width = 12,height = 6, plot = plot1)
空间分布特征模式:nCount_Spatial
plot2 <- SpatialFeaturePlot(sce.all, features = "nCount_Spatial",pt.size.factor = 2)
ggsave(filename = "1-QC/FeaturePlot_nCount.pdf", width = 12,height = 6, plot = plot2)
nFeature_Spatial
分布:
plot3 <- SpatialFeaturePlot(sce.all, features = "nFeature_Spatial",pt.size.factor = 2)
ggsave(filename = "1-QC/FeaturePlot_nFeature.pdf", width = 12,height = 6, plot = plot3)
降维聚类分群:
## 计算线粒体/核糖体/特定基因集的比例
mito_genes <- grep("^MT-", rownames(sce.all),ignore.case = T, value = T)
# 可能是13个线粒体基因
print(mito_genes)
sce.all <- PercentageFeatureSet(sce.all, features = mito_genes, col.name = "percent_mito")
fivenum(sce.all@meta.data$percent_mito)
# 另外一种方法
# sce.all$percent_mito <- (Matrix::colSums(sce.all[["Spatial"]]$counts[mt.genes, ])/Matrix::colSums(sce.all[["Spatial"]]$counts))*100
# fivenum(sce.all$percent_mito)
## 过滤spot
sce.all <- subset(sce.all, subset = nFeature_Spatial > 200 )
## 标准化
options(future.globals.maxSize= 5*1024*1024^2)
sce.all <- SCTransform(sce.all, assay = "Spatial", verbose = T)
sce.all <- RunPCA(sce.all, assay = "SCT", verbose = FALSE)
sce.all <- FindNeighbors(sce.all, reduction = "pca", dims = 1:30)
sce.all <- FindClusters(sce.all, verbose = FALSE,resolution = 0.8)
sce.all <- RunUMAP(sce.all, reduction = "pca", dims = 1:30)
p1 <- DimPlot(sce.all, reduction = "umap", label = TRUE,label.size = 7)
p2 <- SpatialDimPlot(sce.all, label = TRUE, label.size = 3,pt.size.factor = 2,image.alpha = 0.6)
p <- p1 + p2
p
ggsave(filename = "1-QC/DimPlot.pdf", width = 12,height = 6, plot = p)
降维聚类结果如下:
上面的代码运行过程中踩了两个坑,对应的报错以及github上面给出的解决方案如下:
踩坑1:矩阵在log后含NA值,https://github.com/satijalab/sctransform/issues/86,我这里 设置了 CreateSeuratObject(counts = data, assay = "Spatial", min.cells = 3, project = pro)的时候min.cells过滤掉在所有细胞中表达都为0的基因。
Running SCTransform on assay: Spatial
Running SCTransform on layer: counts
vst.flavor='v2' set. Using model with fixed slope and excluding poisson genes.
Error in make_cell_attr(umi, cell_attr, latent_var, batch_var, latent_var_nonreg, :
cell attribute "log_umi" contains NA, NaN, or infinite value
踩坑2:运行内存不足 https://github.com/satijalab/seurat/discussions/9191
Error in getGlobalsAndPackages(expr, envir = envir, globals = globals) :
The total size of the 19 globals exported for future expression (‘FUN()’) is 555.16 MiB.. This exceeds the maximum allowed size of 500.00 MiB (option 'future.globals.maxSize'). The three largest globals are ‘FUN’ (538.70 MiB of class ‘function’), ‘umi_bin’ (16.12 MiB of class ‘numeric’) and ‘data_step1’ (270.82 KiB of class ‘list’)
到这里简单的基础分析就做完了~
扫码关注腾讯云开发者
领取腾讯云代金券
Copyright © 2013 - 2025 Tencent Cloud. All Rights Reserved. 腾讯云 版权所有
深圳市腾讯计算机系统有限公司 ICP备案/许可证号:粤B2-20090059 深公网安备号 44030502008569
腾讯云计算(北京)有限责任公司 京ICP证150476号 | 京ICP备11018762号 | 京公网安备号11010802020287
Copyright © 2013 - 2025 Tencent Cloud.
All Rights Reserved. 腾讯云 版权所有