前言:为什么我们要做聚类分析
当我们试图分析样本的整体V、D、J的使用频率与样本的其他特征如(组别,性别、治疗前后、疾病活动度等其他临床特征)是否具有相关性。我们需要对样本进行聚类分析。不论是基于样本的V、D、J使用频率还是其他免疫组库的特征参数,通过不同的聚类分析方法,探究这些免疫组库的特征与样本之间的相关性。当数据矩阵中行信息是某一特征集合如不同的IGH基因名或者克隆子序列或者是CDR3氨基酸长度,而列信息是样本或相关特征,这种数据矩阵其实与转录组基因表达矩阵无差异,PCA可以知道样本组间的差异性如何,差异基因聚类热图可以发现样本的相似性如何,而单细胞转录组上的单细胞聚类则为注释不同细胞亚群提供前提条件。因此,聚类分析的方法广泛用于生物信息学。
我当初只知道用PCA分析方法即可,而当我试图进一步了解聚类分析时候,才发现有太多的分析方法。如果详细写下每一个聚类方法,则偏离了本文的主旨。也让我踌躇许久才敢下笔。我选择一种简单的方式,选择immunarch这个免疫分析R包对基因使用频率的高级分析内容来讲。这部分内容基于V/D/J使用频率的聚类分析方法较为全面,操作性较强。
相似度分析
"js" (Jensen-Shannon Divergence) 功能:衡量两组基因使用频率分布之间的差异,适用于比较不同样本或组间的免疫组库相似性。 应用:识别疾病组与健康组间V/J基因使用的显著差异模式。
"cor" (Correlation) 功能:计算基因使用频率的皮尔逊或斯皮尔曼相关性,揭示样本间的线性关联。 应用:分析不同时间点或治疗前后免疫组库的动态变化。
"cosine" (Cosine Similarity) 功能:评估向量夹角相似度,适用于高维稀疏数据(如低频克隆型)。 应用:比较肿瘤浸润T细胞与外周血T细胞的克隆组成相似性。
2、降维处理
"pca" (Principal Component Analysis) 功能:降维并提取主要变异方向,减少数据噪声。 应用:可视化大样本免疫组库数据的整体结构,如区分不同疾病亚型。
"mds" (Multi-Dimensional Scaling) 功能:基于距离矩阵保留样本间全局相似性,适用于非线性数据。 应用:展示不同个体间免疫组库的整体差异(如移植后免疫重建)。
"tsne" (t-SNE) 功能:非线性降维,突出局部聚类结构,适合高维数据可视化。 应用:揭示罕见克隆型或微小亚群的分布特征。
3、聚类方法 "hclust" (Hierarchical Clustering) 功能:基于树状图的层次聚类,可结合不同距离度量(如欧氏距离)。 应用:分组样本并识别共享克隆扩增模式的免疫亚群。
"kmeans" (K-means) 功能:划分样本为K个簇,需预先指定簇数,适合球形分布数据。 应用:快速分类免疫组库数据(如疫苗响应高/低组)。
"dbscan" (DBSCAN) 功能:基于密度的聚类,自动识别噪声点,适应不规则形状簇。 应用:检测低频但生物学相关的克隆型(如自身免疫病中的稀有克隆)。
Usage
geneUsageAnalysis(
.data,
.method = c("js+hclust", "pca+kmeans", "anova", "js+pca+kmeans"),
.base = 2,
.norm.entropy = FALSE,
.cor = c("pearson", "kendall", "spearman"),
.do.norm = TRUE,
.laplace = 1e-12,
.verbose = TRUE,
.k = 2,
.eps = 0.01,
.perp = 1,
.theta = 0.1
)
函数参数Arguments
.data
The geneUsageAnalysis function runs on the output from geneUsage.
.method
A string that defines the type of analysis to perform.Can be "pca", "mds", "js",
"kmeans", "hclust", "dbscan" or "cor" if you want to calculate correlation
coefficient. In the latter case you have to provide .cor argument.
.base
A numerical value that defines the logarithm base for Jensen-Shannon divergence.
.norm.entropy
A logical value. Set TRUE to normalise your data if you haven't done it already.
.cor
A string that defines the correlation coefficient for analysis.
Can be "pearson", "kendall" or "spearman".
.do.norm
A logical value. If TRUE it forces Laplace smoothing,
if NA it checks if smoothing is necessary, if FALSE does nothing.
.laplace
The numeric value, which is used as a pseudocount for Laplace smoothing.
.verbose
A logical value.
.k
The number of clusters to create, passed as k to hcut or as centers to kmeans.
.eps
A numerical value, DBscan epsylon parameter, see immunr_dbscan.
.perp
A numerical value, t-SNE perplexity, see immunr_tsne.
.theta
A numerical value, t-SNE theta parameter, see immunr_tsne.
vis(geneUsageAnalysis(imm_gu, "
cor+pca+kmean
s", .verbose = F),.plot = "clust")
在上述代码中,Method部分"cor+pca+kmeans"可以以A+B+C 的组合方式来表示;
A=相似度参数(js/cor/cosine)
B=降维方法(pca/tsne/mds)
C=聚类方法(Kmeans/hclust/dbscan)
上述不仅仅可以说A+B+C的组合,还可以是A+C ,B+C的组合。
部分结果如下图所示
vis(geneUsageAnalysis(imm_gu, .method = "x1", .verbose = F),
.title = "TRBV usage x1", .leg.title = "x1", .text.size = 2)
X1可以是=js/cor/cosine/pca
聚类组合代码如下
vis(geneUsageAnalysis(imm_gu, "cor+mds+kmeans", .verbose = F),.plot = "clust")
vis(geneUsageAnalysis(imm_gu, "cor+mds+hclust", .verbose = F),.plot = "clust")
vis(geneUsageAnalysis(imm_gu, "cor+mds+dbscan", .verbose = F),.plot = "clust")
vis(geneUsageAnalysis(imm_gu, "cor+tsne+kmeans", .verbose = F),.plot = "clust")
vis(geneUsageAnalysis(imm_gu, "cor+tsne+hclust", .verbose = F),.plot = "clust")
vis(geneUsageAnalysis(imm_gu, "cor+tsne+dbscan", .verbose = F),.plot = "clust")
##cor to js
vis(geneUsageAnalysis(imm_gu, "js+pca+kmeans", .verbose = F),.plot = "clust")
vis(geneUsageAnalysis(imm_gu, "js+pca+hclust", .verbose = F),.plot = "clust")
vis(geneUsageAnalysis(imm_gu, "js+pca+dbscan", .verbose = F),.plot = "clust")
vis(geneUsageAnalysis(imm_gu, "js+mds+kmeans", .verbose = F),.plot = "clust")
vis(geneUsageAnalysis(imm_gu, "js+mds+hclust", .verbose = F),.plot = "clust")
vis(geneUsageAnalysis(imm_gu, "js+mds+dbscan", .verbose = F),.plot = "clust")
vis(geneUsageAnalysis(imm_gu, "js+tsne+kmeans", .verbose = F),.plot = "clust")
vis(geneUsageAnalysis(imm_gu, "js+tsne+hclust", .verbose = F),.plot = "clust")
vis(geneUsageAnalysis(imm_gu, "js+tsne+dbscan", .verbose = F),.plot = "clust")
vis(geneUsageAnalysis(imm_gu, "js+kmeans", .verbose = F),.plot = "clust")
vis(geneUsageAnalysis(imm_gu, "js+hclust", .verbose = F),.plot = "clust")
vis(geneUsageAnalysis(imm_gu, "js+dbscan", .verbose = F),.plot = "clust")
vis(geneUsageAnalysis(imm_gu, "pca+kmeans", .verbose = F),.plot = "clust")
vis(geneUsageAnalysis(imm_gu, "pca+hclust", .verbose = F),.plot = "clust")
vis(geneUsageAnalysis(imm_gu, "pca+dbscan", .verbose = F),.plot = "clust")
vis(geneUsageAnalysis(imm_gu, "cosine+kmeans", .verbose = F),.plot = "clust")
vis(geneUsageAnalysis(imm_gu, "cosine+hclust", .verbose = F),.plot = "clust")
vis(geneUsageAnalysis(imm_gu, "cosine+dbscan", .verbose = F),.plot = "clust")
第一种聚类:做相似度处理后进行聚类。
vis(geneUsageAnalysis(imm_gu, "cosine+kmeans", .verbose = F),.plot = "clust")
第二种聚类:先做相似度或相差度处理,在做降维分析,在此基础上进行聚类分析。这也是最常见合理的一种方式。
vis(geneUsageAnalysis(imm_gu, "js+pca+kmeans", .verbose = F),.plot = "clust")
Hclust结果形式如下
vis(geneUsageAnalysis(imm_gu, "js+mds+hclust", .verbose = F),.plot = "clust")