前往小程序,Get更优阅读体验!
立即前往
发布
社区首页 >专栏 >11-肿瘤外显子1.1-gatk 最佳实践:开篇

11-肿瘤外显子1.1-gatk 最佳实践:开篇

作者头像
北野茶缸子
发布2022-07-07 14:46:59
发布2022-07-07 14:46:59
1.4K00
代码可运行
举报
运行总次数:0
代码可运行
  • Date : [[2022-06-03_Fri]]
  • Tags : #生信/外显子/实践 #生信/外显子/gatk

前言

GATK,即Genome Analysis Toolkit,GATK 在鉴定肿瘤的SNP, INDEL, CNV 等方面也堪称行业标准。

个人觉得,如同转录组分析时绕不过的degseq2, limma, edgeR 差异分析三大R 包一样,现在进行肿瘤外显子分析,从gatk入手,可谓是站在巨人的肩膀上。

虽然我们在看各个肿瘤外显子的文章时,其未必用的是gatk 的软件,比如:Multi-region sequencing unveils novel actionable targets and spatial heterogeneity in esophageal squamous cell carcinoma | Nature Communications[1]

Somatic indels (insertions or deletions) were called by Pindel with supported reads ≥6, coverage ≥20, and VAF ≥ 0.1

再比如Clonal architecture in mesothelioma is prognostic and shapes the tumour microenvironment | Nature Communications[2]

Somatic SNVs and INDELs were detected with VarScan2 and MuTect2. Briefly, VarScan2 somatic (v2.3) were used to do somatic variants calling between tumour and matched normal samples based on the output from SAMtools mpileup (1.0) All detected variants were annotated with Annovar

以及Whole-exome sequencing identifies somatic mutations associated with lung cancer metastasis to the brain - Liu - Annals of Translational Medicine (amegroups.com)[3]

sorted and removed polymerase chain reaction (PCR) duplication using GATK 4.1.2.0 Somatic mutation calling was performed using Mutect1, Mutect2 (18[4]), and VarDict (19[5]). Somatic mutations existing in at least 2 of the results of the 3 software were selected as high confident mutations and to be involved in the further bioinformatics and bio-statistical analysis. Copy number variants (CNVs) from WES data were detected by CNVKIT (Genome-Wide Copy Number Detection and Visualization from Targeted DNA Sequencing)

虽然上面一些文章不尽然使用的是gatk,但可以看到如mutect2 多次出现,作为gatk 的模块,也足见它的影响力。

现在下载gatk 也很方便了,可以在官网:Getting started with GATK4 – GATK[6]

也可以直接使用conda 下载最新版:

代码语言:javascript
代码运行次数:0
复制
conda install -y gatk4

1-gatk最佳实践没有说的部分

比如开放平台的测序数据获取。如fastq-dump 与prefetch,亦或是ascp 等。还有一个好用的工具:kingfisher 公共测序数据 SRA/Fastq 下载神器!- 知乎[7]

此外,gatk 也没有给测序数据质控的相关建议。

而实际上,在比对前,还是需要对数据进行质控的。一个较差的测序数据是会影响后续的分析的。无论是进行变异查找,还是表达数据。

比如这篇 The Genomic and Immune Landscapes of Lethal Metastatic Breast Cancer: Cell Reports 就提到:

Adaptor and low-quality base (Phred score below 20) trimming

我们可以使用软件对测序数据进行过滤,并通过质控软件来比较质控前后的数据质控指标的变化。

2-比对数据处理

参考:Data pre-processing for variant discovery – GATK[8]

一图胜千言:

可以看到看到gatk 展示了一个uBAM 数据类型,一个有意思的玩法:什么是ubam文件,为什么ubam文件比fastq文件好 - Zhongxu blog (zxzyl.com)[9]

主要包括去重复和测序质量值重校正的步骤。

有意思的是,并非所有研究都会做比对后的去重复等操作:

3-somatic mutation calling

参考:(How to) Call somatic mutations using GATK4 Mutect2 – GATK[10]

Somatic short variant discovery (SNVs + Indels) – GATK[11]

肿瘤的somatic mutation(随机突变),可以简单理解成个体自身细胞在分裂过程中,发生的新的突变。因此在肿瘤外显子中,我们也一般使用正常对照-肿瘤的测序采样策略,进行配对的分析。以发掘肿瘤细胞中产生的新的突变。

ps:虽然也有tumor only的分析策略,但会有假阳性过多的问题。

其实很多文章,未必会考虑如此之多,比如考虑sample 之间的污染,抑或是在calling 时候就考虑germline 突变以进行过滤。

包括上面提到的文章步骤你也可以看到:

Somatic mutation calling was performed using Mutect1, Mutect2 (18[12]), and VarDict (19[13]). Somatic mutations existing in at least 2 of the results of the 3 software were selected as high confident mutations and to be involved in the further bioinformatics and bio-statistical analysis.

它们仅仅是使用mutect2 等软件,取交集即可。

gatk 考虑如此之多的模块有什么作用呢?

比如FilterMutectCalls, CalculateContamination 等等。

最终得到的高频突变,也可以使用这样的瀑布图进行概况:

引自:Whole-exome sequencing identifies somatic mutations associated with lung cancer metastasis to the brain - Liu - Annals of Translational Medicine[14]

4-copy number variant

参考:Somatic copy number variant discovery (CNVs) – GATK[15]

CNV 是大于 1kb 的基因组区域发生拷贝数的扩增和删失,其同样在癌症的发生发展中有重要的作用。

可以看到拷贝数分析也是从bam 文件开始的。只不过按照我先前的经验来说,直接使用cnvkit 计算拷贝数发生区域,而是用gistics 获 得拷贝数显著发生的位点。

显著拷贝数变化具体发生的位点:

引自:Whole-exome sequencing identifies somatic mutations associated with lung cancer metastasis to the brain - Liu - Annals of Translational Medicine[16]

其他部分

此外还有germline, 或者是基于RNA-seq 的突变分析,等等。

后面再来进行补充。

其他学习资源

正好我最近也在学snakemake,有一些基于gatk 的流程项目:OVarFlow: a resource optimized GATK 4 based Open source Variant calling workFlow - PMC[17]

ESR-NZ/human_genomics_pipeline: A Snakemake workflow to process single samples or cohorts of paired-end sequencing data (WGS or WES) using trim galore/bwa/GATK4/parabricks.[18]

ps:上面两个看起来只涉及germline mutation。

以及生信技能树出品的外显子系列文章:肿瘤外显子数据分析指南 · 语雀[19]

癌症基因的somatic mutation calling 流程的评价体系 | 生信菜鸟团[20]

gatk官方juypter notebook教程:Notebooks[21]

mutect 的详细算法原理:gatk/mutect.pdf at master · broadinstitute/gatk[22]

Systematic comparison of somatic variant calling performance among different sequencing depth and mutation frequency | Scientific Reports (nature.com)[23]

一些不错的文献:Multi-region sequencing unveils novel actionable targets and spatial heterogeneity in esophageal squamous cell carcinoma | Nature Communications[24]

Clonal architecture in mesothelioma is prognostic and shapes the tumour microenvironment - PMC[25]

Tracking the Evolution of Non-Small-Cell Lung Cancer - PubMed (nih.gov)[26]

后话

光会使用软件,自然也是不够的。

比如这样的signature分析:

拷贝数分析:

以及多为点的进化分析:

ps:上面的几张图均来自文章,Multi-region sequencing unveils novel actionable targets and spatial heterogeneity in esophageal squamous cell carcinoma | Nature Communications[27]

得多读文章才行。

此外,我自己也有许多地方存在疑惑,如有问题,欢迎大家后台予我指正。我会谢谢你的(●'◡'●)。与诸君一同学,共勉。

参考资料

[1]

Multi-region sequencing unveils novel actionable targets and spatial heterogeneity in esophageal squamous cell carcinoma | Nature Communications: https://www.nature.com/articles/s41467-019-09255-1#Sec10

[2]

Clonal architecture in mesothelioma is prognostic and shapes the tumour microenvironment | Nature Communications: https://www.nature.com/articles/s41467-021-21798-w

[3]

Whole-exome sequencing identifies somatic mutations associated with lung cancer metastasis to the brain - Liu - Annals of Translational Medicine (amegroups.com): https://atm.amegroups.com/article/view/68581/html#methods

[4]

18: https://atm.amegroups.com/article/view/68581/html#B18

[5]

19: https://atm.amegroups.com/article/view/68581/html#B19

[6]

Getting started with GATK4 – GATK: https://gatk.broadinstitute.org/hc/en-us/articles/360036194592-Getting-started-with-GATK4

[7]

kingfisher 公共测序数据 SRA/Fastq 下载神器!- 知乎: https://zhuanlan.zhihu.com/p/399312134

[8]

Data pre-processing for variant discovery – GATK: https://gatk.broadinstitute.org/hc/en-us/articles/360035535912-Data-pre-processing-for-variant-discovery

[9]

什么是ubam文件,为什么ubam文件比fastq文件好 - Zhongxu blog (zxzyl.com): https://www.zxzyl.com/archives/1016/

[10]

(How to) Call somatic mutations using GATK4 Mutect2 – GATK: https://gatk.broadinstitute.org/hc/en-us/articles/360035531132

[11]

Somatic short variant discovery (SNVs + Indels) – GATK: https://gatk.broadinstitute.org/hc/en-us/articles/360035894731-Somatic-short-variant-discovery-SNVs-Indels-

[12]

18: https://atm.amegroups.com/article/view/68581/html#B18

[13]

19: https://atm.amegroups.com/article/view/68581/html#B19

[14]

Whole-exome sequencing identifies somatic mutations associated with lung cancer metastasis to the brain - Liu - Annals of Translational Medicine: https://atm.amegroups.com/article/view/68581/html

[15]

Somatic copy number variant discovery (CNVs) – GATK: https://gatk.broadinstitute.org/hc/en-us/articles/360035535892-Somatic-copy-number-variant-discovery-CNVs-

[16]

Whole-exome sequencing identifies somatic mutations associated with lung cancer metastasis to the brain - Liu - Annals of Translational Medicine: https://atm.amegroups.com/article/view/68581/html

[17]

OVarFlow: a resource optimized GATK 4 based Open source Variant calling workFlow - PMC: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8361789/

[18]

ESR-NZ/human_genomics_pipeline: A Snakemake workflow to process single samples or cohorts of paired-end sequencing data (WGS or WES) using trim galore/bwa/GATK4/parabricks.: https://github.com/ESR-NZ/human_genomics_pipeline

[19]

肿瘤外显子数据分析指南 · 语雀: https://www.yuque.com/biotrainee/wes

[20]

癌症基因的somatic mutation calling 流程的评价体系 | 生信菜鸟团: http://www.bio-info-trainee.com/2529.html

[21]

Notebooks: https://notebooks.githubusercontent.com/view/ipynb?browser=chrome&color_mode=auto&commit=e1946f5f8f025a416872d27bbb4fe86b9c68179e&device=unknown&enc_url=68747470733a2f2f7261772e67697468756275736572636f6e74656e742e636f6d2f6761746b2d776f726b666c6f77732f6761746b342d6a7570797465722d6e6f7465626f6f6b2d7475746f7269616c732f653139343666356638663032356134313638373264323762626234666538366239633638313739652f6e6f7465626f6f6b732f446179332d536f6d617469632f312d736f6d617469632d6d7574656374322d7475746f7269616c2e6970796e62&logged_in=false&nwo=gatk-workflows%2Fgatk4-jupyter-notebook-tutorials&path=notebooks%2FDay3-Somatic%2F1-somatic-mutect2-tutorial.ipynb&platform=android&repository_id=183114910&repository_type=Repository&version=98

[22]

gatk/mutect.pdf at master · broadinstitute/gatk: https://github.com/broadinstitute/gatk/blob/master/docs/mutect/mutect.pdf

[23]

Systematic comparison of somatic variant calling performance among different sequencing depth and mutation frequency | Scientific Reports (nature.com): https://www.nature.com/articles/s41598-020-60559-5#:~:text=A%20large%20number%20of%20tools%20are%20able%20to,one%20of%20the%20most%20widely%20used%20mutation-calling%20tools.

[24]

Multi-region sequencing unveils novel actionable targets and spatial heterogeneity in esophageal squamous cell carcinoma | Nature Communications: https://www.nature.com/articles/s41467-019-09255-1#Sec10

[25]

Clonal architecture in mesothelioma is prognostic and shapes the tumour microenvironment - PMC: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7979861/

[26]

Tracking the Evolution of Non-Small-Cell Lung Cancer - PubMed (nih.gov): https://pubmed.ncbi.nlm.nih.gov/28445112/

[27]

Multi-region sequencing unveils novel actionable targets and spatial heterogeneity in esophageal squamous cell carcinoma | Nature Communications: https://www.nature.com/articles/s41467-019-09255-1#Sec10

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2022-06-08,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 北野茶缸子 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 前言
  • 1-gatk最佳实践没有说的部分
  • 2-比对数据处理
  • 3-somatic mutation calling
  • 4-copy number variant
  • 其他部分
  • 其他学习资源
  • 后话
    • 参考资料
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档