关于Python和R编程语言的优先级的讨论实在是太多了, 小朋友才做选择,成年人就是全都学!我们有:掌握Python,解锁单细胞数据的无限可能,也有:生信入门&数据挖掘线上直播课12月班,掌握多种编程语言更方便应对各种生物信息学数据分析场合!
官网github链接:https://github.com/saezlab/decoupleR
文献信息:
Badia-i-Mompel P.,。。。。2022. decoupleR: ensemble of computational methods to infer biological activities from omics data. Bioinformatics Advances. https://doi.org/10.1093/bioadv/vbac016
11个被囊括的包如下:
image-20241202163856344
options(BioC_mirror="https://mirrors.westlake.edu.cn/bioconductor")
options("repos"=c(CRAN="https://mirrors.westlake.edu.cn/CRAN/"))
install.packages('remotes')
remotes::install_github('saezlab/decoupleR')
下面来简单看一下其中一个功能:Pathway activity inference in bulk RNA-seq
https://saezlab.github.io/decoupleR/articles/pw_bk.html
在这个笔记本中,我们展示了如何使用decoupleR对一个bulk RNA测序数据集进行通路活性推断,该数据集中胰腺癌细胞系中的转录因子FOXA2被敲除。
数据包括3个野生型(WT)样本和3个敲除型(KO)样本,GEO编号为:https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE119931,是一个表观调控的项目,其中转录组数据部分样品信息如下所示:
作者已经对此数据集进行了预处理,并放到了包中:
# 加载
inputs_dir <- system.file("extdata", package = "decoupleR")
data <- readRDS(file.path(inputs_dir, "bk_data.rds"))
# 简单探索
class(data)
str(data)
可以看到数据为一个list对象,包含三个数据:counts值(为the normalized log-transformed counts)、design分组矩阵、limma_ttop差异基因结果
List of 3
$ counts : spc_tbl_ [11,093 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
..$ gene : chr [1:11093] "NOC2L" "PLEKHN1" "PERM1" "ISG15" ...
..$ PANC1.WT.Rep1 : num [1:11093] 10.05 7.54 6.28 10.94 6.96 ...
..$ PANC1.WT.Rep2 : num [1:11093] 11.95 8.13 6.42 11.47 7.2 ...
..$ PANC1.WT.Rep3 : num [1:11093] 12.06 8.71 6.59 11.43 7.52 ...
..$ PANC1.FOXA2KO.Rep1: num [1:11093] 12.31 8.05 6.29 11.55 7.06 ...
..$ PANC1.FOXA2KO.Rep2: num [1:11093] 12.14 8.29 6.49 11.37 7.49 ...
..$ PANC1.FOXA2KO.Rep3: num [1:11093] 11.49 8.62 6.78 11.18 7.07 ...
..- attr(*, "spec")=
.. .. cols(
.. .. gene = col_character(),
.. .. PANC1.WT.Rep1 = col_double(),
.. .. PANC1.WT.Rep2 = col_double(),
.. .. PANC1.WT.Rep3 = col_double(),
.. .. PANC1.FOXA2KO.Rep1 = col_double(),
.. .. PANC1.FOXA2KO.Rep2 = col_double(),
.. .. PANC1.FOXA2KO.Rep3 = col_double()
.. .. )
$ design : spc_tbl_ [6 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
..$ sample : chr [1:6] "PANC1.WT.Rep1" "PANC1.WT.Rep2" "PANC1.WT.Rep3" "PANC1.FOXA2KO.Rep1" ...
..$ condition: chr [1:6] "PANC1.WT" "PANC1.WT" "PANC1.WT" "PANC1.FOXA2KO" ...
..- attr(*, "spec")=
.. .. cols(
.. .. sample = col_character(),
.. .. condition = col_character()
.. .. )
$ limma_ttop: spc_tbl_ [11,093 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
..$ ID : chr [1:11093] "RHBDL2" "PLEKHH2" "HEG1" "CLU" ...
..$ logFC : num [1:11093] -1.82 -1.57 -1.73 -1.79 2.09 ...
..$ AveExpr : num [1:11093] 8.6 7.7 8.64 12.22 9.61 ...
..$ t : num [1:11093] -12.81 -10.79 -9.79 -9.76 8.95 ...
..$ P.Value : num [1:11093] 0.00000303 0.00000993 0.0000194 0.00001976 0.00003552 ...
..$ adj.P.Val: num [1:11093] 0.0336 0.0548 0.0548 0.0548 0.0788 ...
..$ B : num [1:11093] 3.95 3.27 2.84 2.83 2.43 ...
..- attr(*, "spec")=
.. .. cols(
.. .. ID = col_character(),
.. .. logFC = col_double(),
.. .. AveExpr = col_double(),
.. .. t = col_double(),
.. .. P.Value = col_double(),
.. .. adj.P.Val = col_double(),
.. .. B = col_double()
.. .. )
提取count:
# Remove NAs and set row names
counts <- data$counts %>%
dplyr::mutate_if(~ any(is.na(.x)),
~ dplyr::if_else(is.na(.x), 0, .x)) %>%
tibble::column_to_rownames(var = "gene") %>%
as.matrix()
head(counts)
提取The design meta-data:
design <- data$design
design
提取limma结果:会用到结果中的t值
# Extract t-values per gene
deg <- data$limma_ttop %>%
dplyr::select(ID, t) %>%
dplyr::filter(!is.na(t)) %>%
tibble::column_to_rownames(var = "ID") %>%
as.matrix()
head(deg)
访问网址:https://saezlab.github.io/progeny/
PROGENy是一个包含了经过整理的通路及其靶基因的集合,并且为每个相互作用提供了权重。在这个例子中,我们将使用人类权重(也提供了其他生物体的权重),并且我们将使用按p值排名的前500个responsive genes。以下是每个通路的简要描述:
获取:
# 这里需要安装一下OmnipathR包
# BiocManager::install("OmnipathR")
net <- decoupleR::get_progeny(organism = 'human', top = 500)
net
head(net)
table(net$source)
# 保存一下,这里不是很好下载,后面看看是不是有什么地方可以加速这个下载
saveRDS(net,file = "net.rds")
活性推断采用的模型为:Multivariate Linear Model (MLM),MLM在文章中方法的可靠性排前三。推断的打分如果为positive,表示 the pathway is active;如果是negative,则inactive。
# Run mlm
sample_acts <- decoupleR::run_mlm(mat = counts,
net = net,
.source = 'source',
.target = 'target',
.mor = 'weight',
minsize = 5)
sample_acts
结果如下:
首先提取每个样本中每条通路的score:
# Transform to wide matrix
sample_acts_mat <- sample_acts %>%
tidyr::pivot_wider(id_cols = 'condition',
names_from = 'source',
values_from = 'score') %>%
tibble::column_to_rownames('condition') %>%
as.matrix()
sample_acts_mat
# Scale per feature
sample_acts_mat <- scale(sample_acts_mat)
打分矩阵:
绘图:
# Color scale
colors <- rev(RColorBrewer::brewer.pal(n = 11, name = "RdBu"))
colors.use <- grDevices::colorRampPalette(colors = colors)(100)
my_breaks <- c(seq(-2, 0, length.out = ceiling(100 / 2) + 1),
seq(0.05,2, length.out = floor(100 / 2)))
# Plot
pheatmap::pheatmap(mat = sample_acts_mat,
color = colors.use,
border_color = "white",
breaks = my_breaks,
cellwidth = 20,
cellheight = 20,
treeheight_row = 20,
treeheight_col = 20)
打分结果:
# Run mlm
contrast_acts <- decoupleR::run_mlm(mat =deg,
net = net,
.source = 'source',
.target = 'target',
.mor = 'weight',
minsize = 5)
contrast_acts
# 可视化
colors <- rev(RColorBrewer::brewer.pal(n = 11, name = "RdBu")[c(2, 10)])
p <- ggplot2::ggplot(data = contrast_acts,
mapping = ggplot2::aes(x = stats::reorder(source, score),
y = score)) +
ggplot2::geom_bar(mapping = ggplot2::aes(fill = score),
color = "black",
stat = "identity") +
ggplot2::scale_fill_gradient2(low = colors[1],
mid = "whitesmoke",
high = colors[2],
midpoint = 0) +
ggplot2::theme_minimal() +
ggplot2::theme(axis.title = element_text(face = "bold", size = 12),
axis.text.x = ggplot2::element_text(angle = 45,
hjust = 1,
size = 10,
face = "bold"),
axis.text.y = ggplot2::element_text(size = 10,
face = "bold"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
ggplot2::xlab("Pathways")
p
结果:与野生型(WT)相比,敲除型(KO)中的p53和TRAIL通路被抑制,而MAPKK和JAK-STAT通路似乎被激活,与上面的每个样本中活性打分热图基本一致。
我们可以进一步沿着它们的t值可视化每个通路中最有反应的基因,以解释结果。例如,让我们来看看属于MAPK通路的基因:
pathway <- 'MAPK'
df <- net %>%
dplyr::filter(source == pathway) %>%
dplyr::arrange(target) %>%
dplyr::mutate(ID = target,
color = "3") %>%
tibble::column_to_rownames('target')
inter <- sort(dplyr::intersect(rownames(deg), rownames(df)))
df <- df[inter, ]
df['t_value'] <- deg[inter, ]
df <- df %>%
dplyr::mutate(color = dplyr::if_else(weight > 0 & t_value > 0, '1', color)) %>%
dplyr::mutate(color = dplyr::if_else(weight > 0 & t_value < 0, '2', color)) %>%
dplyr::mutate(color = dplyr::if_else(weight < 0 & t_value > 0, '2', color)) %>%
dplyr::mutate(color = dplyr::if_else(weight < 0 & t_value < 0, '1', color))
colors <- rev(RColorBrewer::brewer.pal(n = 11, name = "RdBu")[c(2, 10)])
p <- ggplot2::ggplot(data = df,
mapping = ggplot2::aes(x = weight,
y = t_value,
color = color)) +
ggplot2::geom_point(size = 2.5,
color = "black") +
ggplot2::geom_point(size = 1.5) +
ggplot2::scale_colour_manual(values = c(colors[2], colors[1], "grey")) +
ggrepel::geom_label_repel(mapping = ggplot2::aes(label = ID)) +
ggplot2::theme_minimal() +
ggplot2::theme(legend.position = "none") +
ggplot2::geom_vline(xintercept = 0, linetype = 'dotted') +
ggplot2::geom_hline(yintercept = 0, linetype = 'dotted') +
ggplot2::ggtitle(pathway)
p
结果如下:
可以去试试看~
前面我们介绍的GSE119931数据集里面有两个不同的细胞系实验,都是两分组,就是FOXA2-KO和WT的差异分析,大家试试看读取 https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE119931 里面的表达量矩阵:
GSM3387462 PANC1.WT.Rep1
GSM3387463 PANC1.WT.Rep2
GSM3387464 PANC1.WT.Rep3
GSM3387465 PANC1.FOXA2-KO.Rep1
GSM3387466 PANC1.FOXA2-KO.Rep2
GSM3387467 PANC1.FOXA2-KO.Rep3
GSM3387468 CFPAC1.WT.Rep1
GSM3387469 CFPAC1.WT.Rep2
GSM3387470 CFPAC1.WT.Rep3
GSM3387474 CFPAC1.FOXA2-KD.Rep1
GSM3387475 CFPAC1.FOXA2-KD.Rep2
GSM3387476 CFPAC1.FOXA2-KD.Rep3
找到了GSE119931_raw_counts_GRCh38.p13_NCBI.tsv.gz,这个2022-12-15的749.7 Kb的文件,然后进行两次差异分析,然后做交集看看,韦恩图或者更多丰富的方法, 也可以去看看文献:FOXA2 controls the cis-regulatory networks of pancreatic cancer cells in a differentiation grade-specific manner. EMBO J 2019 Oct 做更丰富的生物学解读!