前往小程序,Get更优阅读体验!
立即前往
发布
社区首页 >专栏 >一个优秀的ATAC-seq数据分析资源实战(二)

一个优秀的ATAC-seq数据分析资源实战(二)

作者头像
生信技能树
发布2025-02-28 14:45:31
发布2025-02-28 14:45:31
9200
代码可运行
举报
文章被收录于专栏:生信技能树生信技能树
运行总次数:0
代码可运行

之前我们给大家介绍了两篇ATAC-Seq数据分析pipeline的优秀综述:综述:ATAC-Seq 数据分析工具大全Omni-ATAC:更新和优化的ATAC-seq协议(NatProtoc),我们今天就来实战介绍!

这次选择介绍的资源如下,这些内容是通过互联网上的多种资源(论坛、论文、研讨会等)整理而成的。作者无法一一列出所有来源,但想要感谢他们分享经验、知识和代码:

网址:https://yiweiniu.github.io/blog/2019/03/ATAC-seq-data-analysis-from-FASTQ-to-peaks/

续接上一个帖子:一个优秀的ATAC-seq数据分析资源实战(一)

上次运行到使用Picard的MarkDuplicates工具去除PCR重复的reads时,这里报错:

代码语言:javascript
代码运行次数:0
复制
java -XX:ParallelGCThreads=30 -Djava.io.tmpdir=/tmp -jar /nas2/zhangj/biosoft/picard/picard.jar MarkDuplicates QUIET=true INPUT=SRR14205836.rmChrM.bam OUTPUT=SRR14205836.rmDup.bam METRICS_FILE=SRR14205836.sorted.metrics REMOVE_DUPLICATES=true CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT TMP_DIR=/tmp
java -XX:ParallelGCThreads=30 -Djava.io.tmpdir=/tmp -jar /nas2/zhangj/biosoft/picard/picard.jar MarkDuplicates QUIET=true INPUT=SRR14205837.rmChrM.bam OUTPUT=SRR14205837.rmDup.bam METRICS_FILE=SRR14205837.sorted.metrics REMOVE_DUPLICATES=true CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT TMP_DIR=/tmp
(rna) [zhangj@biowisdom Mapping]$ java -XX:ParallelGCThreads=30 -Djava.io.tmpdir=/tmp -jar /nas2/zhangj/biosoft/picard/picard.jar MarkDuplicates QUIET=true INPUT=SRR14205836.rmChrM.bam OUTPUT=SRR14205836.rmDup.bam METRICS_FILE=SRR14205836.sorted.metrics REMOVE_DUPLICATES=true CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT TMP_DIR=/tmp
INFO 2025-02-26 19:03:33 MarkDuplicates

********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
**********
https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
**********    MarkDuplicates -QUIET true -INPUT SRR14205836.rmChrM.bam -OUTPUT SRR14205836.rmDup.bam -METRICS_FILE SRR14205836.sorted.metrics -REMOVE_DUPLICATES true -CREATE_INDEX true -VALIDATION_STRINGENCY LENIENT -TMP_DIR /tmp
**********

19:03:33.717 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/nas2/zhangj/biosoft/picard/picard.jar!/com/intel/gkl/native/libgkl_compression.so
INFO 2025-02-26 19:03:33 MarkDuplicates Start of doWork freeMemory: 274413664; totalMemory: 285212672; maxMemory: 31675383808
INFO 2025-02-26 19:03:33 MarkDuplicates Reading input file and constructing read end information.
INFO 2025-02-26 19:03:33 MarkDuplicates Will retain up to 114765883 data points before spilling to disk.
WARNING 2025-02-26 19:03:34 AbstractOpticalDuplicateFinderCommandLineProgram A field field parsed out of a read name was expected to contain an integer and did not. Read name: SRR14205836.1843967. Cause: String 'SRR14205836.1843967' did not start with a parsable number.

Exception in thread "main" java.lang.NullPointerException: Cannot invoke "htsjdk.samtools.SAMReadGroupRecord.getReadGroupId()" because the return value of "htsjdk.samtools.SAMRecord.getReadGroup()" is null
 at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:558)
 at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:270)
 at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:281)
 at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:105)
 at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:115)
第一个错误:

这个提示告诉我,Picard的命令已经发生了变化,可前往:https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition) 进行确认。

更新后的代码:参数全部由=号变成了---'进行传参

第二个错误:https://github.com/broadinstitute/gatk/issues/8904
代码语言:javascript
代码运行次数:0
复制
samtools view -H SRR14205836.rmChrM.bam | grep '@RG' 

# @RG\tID:AB318-005\tLB:AB318-005\tPL:ILLUMINA\tSM:AB318-005
# @RG\tID:ReadGroup1\tSM:SampleName\tPL:Illumina\tLB:Library.fa
# You can set ReadGroup1 and SampleName and Library.fa to something really existing in your dataset (lane number, real sample name and the name of the fasta file)
# 运行下面两句
samtools addreplacerg -r "@RG\tID:SRR14205836\tSM:SRR14205836\tPL:Illumina\tLB:SRR14205836" -o SRR14205836.rmChrM-1.bam  SRR14205836.rmChrM.bam
samtools addreplacerg -r "@RG\tID:SRR14205837\tSM:SRR14205837\tPL:Illumina\tLB:SRR14205837" -o SRR14205837.rmChrM-1.bam  SRR14205837.rmChrM.bam

更新后的代码:

代码语言:javascript
代码运行次数:0
复制
# 生成 mkDup.sh
# REMOVE_DUPLICATES=false: mark duplicate reads, not remove.
# Change it to true to remove duplicate reads.
cat ../data/cleandata/ID | while read id
do
  echo "java -XX:ParallelGCThreads=30 -Xmx50g -Djava.io.tmpdir=/tmp -jar /nas2/zhangj/biosoft/picard/picard.jar MarkDuplicates --QUIET  true -I ${id}.rmChrM-1.bam -O ${id}.mkDup.bam --METRICS_FILE ${id}.sorted.metrics --CREATE_INDEX true --VALIDATION_STRINGENCY LENIENT --TMP_DIR /tmp --ASSUME_SORT_ORDER coordinate"
done >mkDup.sh

# 运行rmDup.sh, 使用parafly进行并行任务投递
nohup ParaFly -c mkDup.sh -CPU 2 1>mkDup.log  2>&1 &

mkDup.sh的内容:

代码语言:javascript
代码运行次数:0
复制
java -XX:ParallelGCThreads=30 -Xmx50g -Djava.io.tmpdir=/tmp -jar /nas2/zhangj/biosoft/picard/picard.jar MarkDuplicates --QUIET  true -I SRR14205836.rmChrM-1.bam -O SRR14205836.mkDup.bam --METRICS_FILE SRR14205836.sorted.metrics --CREATE_INDEX true --VALIDATION_STRINGENCY LENIENT --TMP_DIR /tmp --ASSUME_SORT_ORDER coordinate
java -XX:ParallelGCThreads=30 -Xmx50g -Djava.io.tmpdir=/tmp -jar /nas2/zhangj/biosoft/picard/picard.jar MarkDuplicates --QUIET  true -I SRR14205837.rmChrM.bam -O SRR14205837.mkDup.bam --METRICS_FILE SRR14205837.sorted.metrics --CREATE_INDEX true --VALIDATION_STRINGENCY LENIENT --TMP_DIR /tmp --ASSUME_SORT_ORDER coordinate
Non-unique alignments

Ref: Harvard FAS Informatics - ATAC-seq Guidelines

一些研究人员选择移除那些非唯一比对的读段,这可以通过使用 samtools view 命令的 -q 参数来实现。

不同的基因组比对工具在实现比对质量(MAPQ)时存在差异。可以参考《MAPQ分数的更多乱象(又名:why bioinformaticians hate poor and incomplete software documentation:https://www.acgt.me/?offset=1426809676847).》这篇文章。

因此,当使用MAPQ过滤非唯一比对时,一定要检查所使用的比对工具的MAPQ值。

对于 Bowtie2,可以使用MAPQ > 30. 下面是示例代码,我这里就不运行了:

代码语言:javascript
代码运行次数:0
复制
# Remove multi-mapped reads (i.e. those with MAPQ < 30, using -q in SAMtools)
samtools view -h -q 30 ${sample}.bam > ${sample}.rmMulti.bam
Others

在ENCODE的分析流程或某些论文中,还移除了以下类型的读段(通过 samtoolsflag 参数,值为1796或1804)。

  • reads unmapped,
  • not primary alignment
  • reads failing platform
  • duplicates

剩下的reads就是所谓的:“properly mapped reads”. 下面是示例代码,我这里就不运行了:

代码语言:javascript
代码运行次数:0
复制
# Remove reads unmapped, mate unmapped, not primary alignment, reads failing platform, duplicates (-F 1804)
# Retain properly paired reads -f 2
samtools view -h -b -F 1804 -f 2 ${sample}.bam > ${sample}.filtered.bam
Merging BAMs (可选)

当一个实验条件构建了多个文库(即一个实验条件包含多个生物学和/或技术重复)时,可能需要在识别peaks之前合并不同的BAM文件(例如,合并技术重复的BAM文件,或合并BAM文件以获得更大的测序深度)。

关于“合并BAM文件”或“合并peaks”的考虑已经有所讨论:

  • ChIP-Seq: Calling peaks with replicates:https://www.biostars.org/p/112778/
  • how to pool together biological replicates?:https://www.biostars.org/p/191474/
  • ChipSeq: merge bam file before peak calling:https://www.biostars.org/p/210564/
  • Chip-Seq merging peak files:https://www.biostars.org/p/230055/

合并bam的示例代码如下:

代码语言:javascript
代码运行次数:0
复制
samtools merge -@ $PPN condition1.merged.bam sample1.bam sample2.bam sample3.bam
samtools index -@ $PPN condition1.merged.bam

3.3 Shifting reads

在第一篇ATAC-seq的文献(https://www.nature.com/articles/nmeth.2688)中,由于Tn5转座酶以二聚体形式结合并插入两个相隔9个碱基对的接头,因此所有比对到正链(+ strand)的读段向右偏移+4个碱基对,所有比对到负链(– strand)的读段向左偏移−5个碱基对。

Ref: shifting reads bam for NucleoATAC?

然而,对于peak calling,reads的偏移可能并不是非常重要,因为这只是一个非常微小的调整,而 peaks 通常跨越数百个碱基对。只有在需要精确到单个碱基分辨率的插入位置时,例如转录因子结合位点(TF motif)足迹分析(footprinting),偏移才是至关重要的。

此外,还要记住,并非所有转录因子足迹分析工具都需要偏移后的 reads。其中一些工具可能已经在内部完成了这一处理,例如 NucleoATAC。

调整的示例代码如下:

Ref: Shifting reads for ATAC-seq alignments(https://www.biostars.org/p/187204/#187206)

代码语言:javascript
代码运行次数:0
复制
###################################### 方法1:https://www.biostars.org/p/187204/#187206
# use bedtools and awk
# the BAM file should be sorted by read name beforehand
samtools sort -n -T aln.sorted -o aln.sorted.bam aln.bam

# The bedtools command should extract the paired-end alignments as bedpe format, then the awk command should shift the fragments as needed
bedtools bamtobed -i reads.bam -bedpe | awk -v OFS="\t" '{($9=="+"){print $1,$2+4,$6+4} \
  ($9=="-"){print $1,$2-5,$6-5}}' > fragments.bed
   
###################################### 方法2:https://deeptools.readthedocs.io/en/develop/content/tools/alignmentSieve.html
# use --ATACshift
alignmentSieve --numberOfProcessors 8 --ATACshift --bam sample1.bam -o sample1.tmp.bam

# the bam file needs to be sorted again
samtools sort -@ 8 -O bam -o sample1.shifted.bam sample1.tmp.bam
samtools index -@ 8 sample1.shifted.bam
rm sample1.tmp.bam

###################################### 方法3:https://bioconductor.org/packages/release/bioc/html/ATACseqQC.html
# use ATACseqQC
## load the library
library(ATACseqQC)

## input is bamFile
bamfile <- system.file("extdata", "GL1.bam", package="ATACseqQC", mustWork=TRUE)
bamfile.labels <- gsub(".bam", "", basename(bamfile))

## bamfile tags
tags <- c("AS", "XN", "XM", "XO", "XG", "NM", "MD", "YS", "YT")
## files will be output into outPath
outPath <- "splited"
dir.create(outPath)
## shift the bam file by the 5'ends
library(BSgenome.Hsapiens.UCSC.hg19)
seqlev <- "chr1" ## subsample data for quick run
which <- as(seqinfo(Hsapiens)[seqlev], "GRanges")
gal <- readBamFile(bamfile, tag=tags, which=which, asMates=TRUE)
gal1 <- shiftGAlignmentsList(gal)
shiftedBamfile <- file.path(outPath, "shifted.bam")
export(gal1, shiftedBamfile)

3.4 Peak calling:MACS2

由于双端测序在ATAC-seq中被广泛使用,因此将通过-f参数告知MACS2数据是成对的。通过这种方式,MACS2只会分析正确比对的reads(因为我们是在上述过滤后得到的bam文件)。fragments 由成对的比对来定义,不存在建模或人为延伸。

代码语言:javascript
代码运行次数:0
复制
cd 2022-GSE171832-28-embryonal-atacseq/PeakCalling/
# 激活小环境
conda activate rna
# 安装 macs2
conda install macs2 -y

# -f BAMPE, use paired-end information
# --keep-dup all, keep all duplicate reads.
# macs2.sh 内容如下:
macs2 callpeak -f BAMPE -g hs --keep-dup all --cutoff-analysis -n SRR14205836 -t ../Mapping/SRR14205836.mkDup.bam --outdir SRR14205836/
macs2 callpeak -f BAMPE -g hs --keep-dup all --cutoff-analysis -n SRR14205837 -t ../Mapping/SRR14205837.mkDup.bam --outdir SRR14205837/

# 运行rmDup.sh, 使用parafly进行并行任务投递
nohup ParaFly -c macs2.sh -CPU 2 1>macs2.log  2>&1 &

运行后的结果如下:

代码语言:javascript
代码运行次数:0
复制
├── SRR14205836
│   ├── SRR14205836_cutoff_analysis.txt
│   ├── SRR14205836_peaks.narrowPeak
│   ├── SRR14205836_peaks.xls
│   └── SRR14205836_summits.bed
└── SRR14205837
    ├── SRR14205837_cutoff_analysis.txt
    ├── SRR14205837_peaks.narrowPeak
    ├── SRR14205837_peaks.xls
    └── SRR14205837_summits.bed

3.5 生成browser tracks

使用 deepTools 中的 bamCoverage 来创建用于可视化的 bigWig 文件。它提供了几种不同的信号标准化方法。

代码语言:javascript
代码运行次数:0
复制
# bigWig.sh 内容如下
# bam to bigwig, normalize using 1x effective genome size
# effective genome size: https://deeptools.readthedocs.io/en/latest/content/feature/effectiveGenomeSize.html
bamCoverage --numberOfProcessors 30 --binSize 10 --normalizeUsing RPGC --effectiveGenomeSize 2652783500 --bam ../Mapping/SRR14205836.mkDup.bam -o SRR14205836.bw
bamCoverage --numberOfProcessors 30 --binSize 10 --normalizeUsing RPGC --effectiveGenomeSize 2652783500 --bam ../Mapping/SRR14205837.mkDup.bam -o SRR14205837.bw

# 运行bigWig.sh, 使用parafly进行并行任务投递
nohup ParaFly -c bigWig.sh -CPU 2 1>bigWig.log  2>&1 &

3.6 QC质量检查

在处理ATAC-seq数据时,我们可以生成一些指标来检查数据/文库的质量。

参考:

  • ENCODE - Terms and Definitions:https://www.encodeproject.org/data-standards/terms/#enrichment
  • ENCODE - ATAC-seq Data Standards and Prototype Processing Pipeline:https://www.encodeproject.org/atac-seq/

以下是ENCODE所使用的标准,以下将详细解释每个术语的具体描述。

当前标准:

  • 实验需要有>=2个生物学重复
  • 每个重复数据量至少:25M 非重复reads、非线粒体比对reads(单端数据),双端数据为50M,也就是25M 如果按照一对reads算1个fragment。
  • 比对率:大于95%,最低大于80%可接受范围
  • 重复样本一致性:计算一个IDR值(lrreproducible Discovery Rate),需要 rescue 和 self 一致性率 <2
  • 文库复杂度:NRF>0.9,PBC1>0.9, and PBC2>3
  • peaks 结果 文件的标准:
    • 每个样本中的peaks数 > 150000, 最少>100000
    • 满足IDR的peaks数 > 70000, 最少 > 50000
    • 必须存在一个无核小体区域(Nucleosome-Free Region, NFR)。
    • 片段长度分布中必须存在单核小体峰。这些读段跨越单个核小体,因此其长度大于147 bp但小于147 bp × 2。优质的ATAC-seq数据集包含跨越核小体的reads(这使得除了能够识别染色质的开放区域外,还可以调用核小体的位置)。
  • FRiP score > 0.3,最少>0.2。对于ENCODE的组织样本,FRiP score 不会被强制作为质量控制指标,但转录起始位点(TSS)富集度仍然作为衡量信噪比的关键指标。
  • 转录起始位点(TSS)富集值取决于所使用的参考文件:高质量数据的截止值如下表所示
  • 实验必须通过常规的元数据审核才能发布。

具体的每一点详细讨论,我们后面也会写一些实战笔记。

原稿讨论可见:https://yiweiniu.github.io/blog/2019/03/ATAC-seq-data-analysis-from-FASTQ-to-peaks/

3.7 peaks过滤:Blacklist

可以使用 DAC Blacklisted Regions:https://www.encodeproject.org/annotations/ENCSR636HFF/ 对peaks进行过滤,下面的过滤代码来自:ENCODE - ATAC-seq Data Standards and Prototype Processing Pipeline(https://www.encodeproject.org/atac-seq/)

去上面的DAC Blacklisted Regions下载:files.txt

代码语言:javascript
代码运行次数:0
复制
# 运行下载 得到 ENCFF547MET.bed.gz文件
xargs -n 1 curl -O -L < files.txt

# 这个文件中的 染色体编号 带有chr,我们前面使用的是数字编号,去掉
zless -S ENCFF547MET.bed.gz  |sed 's/^chr//g' >ENCFF547MET.bed

过滤:

note:grep -P 'chr[0-9XY]+(?!_)' 用于匹配以 chr 开头且后面跟随数字或特定字母(XY),但不以 _ 结尾的字符串

代码语言:javascript
代码运行次数:0
复制
# 样本 SRR14205836
bedtools intersect -v -a SRR14205836/SRR14205836_peaks.narrowPeak -b ENCFF547MET.bed \
  | awk 'BEGIN{OFS="\t"} {if ($5>1000) $5=1000; print $0}' \
  | grep -P '^[0-9XY]+(?!_)' | gzip -nc > SRR14205836/SRR14205836_peaks.narrowPeak.filt.gz

# 样本 SRR14205837
bedtools intersect -v -a SRR14205837/SRR14205837_peaks.narrowPeak -b ENCFF547MET.bed \
  | awk 'BEGIN{OFS="\t"} {if ($5>1000) $5=1000; print $0}' \
  | grep -P '^[0-9XY]+(?!_)' | gzip -nc > SRR14205837/SRR14205837_peaks.narrowPeak.filt.gz

narroPeak文件的第5列为一个整数打分:int(-10*log10qvalue),取值范围在 [0, 1000],awk 将>1000的变成了1000。

3.8 合并peaks(可选)

人们可能希望合并来自不同文库或不同样本的peaks。Corces et al., 2016(https://www.nature.com/articles/ng.3646)使用如下的方法合并:

To generate a non-redundant list of hematopoiesis- and cancer-related peaks, we first extended summits to 500-bp windows (±250 bp). We then ranked the 500-bp peaks by summit significance value (defined by MACS2) and chose a list of non-overlapping, maximally significant peaks.

可以使用 BEDOPS工具, (ref: Collapsing multiple BED files into a master list by signal)

下面是示例代码:

代码语言:javascript
代码运行次数:0
复制
#!/bin/bash

# modified from https://bedops.readthedocs.io/en/latest/content/usage-examples/master-list.html

summit_bed=(sample1_summits.bed sample2_summits.bed sample3_summits.bed)

out=fAdrenal.master.merge.bed

tmpd=/tmp/tmp$$
mkdir -p $tmpd

## First, union all the peaks together into a single file.
bedlist=""
for bed in ${beds[*]}
do
    bedlist="$bedlist $summit_bed"
done

# extended summits to 500-bp windows (±250 bp)
bedops --range 250 -u $bedlist > $tmpd/tmp.bed

## The master list is constructed iteratively.  For each pass through
## the loop, elements not yet in the master list are merged into
## non-overlapping intervals that span the union (this is just bedops
## -m).  Then for each merged interval, an original element of highest
## score within the interval is selected to go in the master list.
## Anything that overlaps the selected element is thrown out, and the
## process then repeats.
iters=1
solns=""
stop=0
while [ $stop == 0 ]
do
    echo "merge steps..."

    ## Condense the union into merged intervals. This klugey bit
    ## before and after the merging is because we don't want to merge
    ## regions that are simply adjacent but not overlapping
    bedops -m --range 0:-1 $tmpd/tmp.bed \
        | bedops -u --range 0:1 - \
        > $tmpd/tmpm.bed

    ## Grab the element with the highest score among all elements forming each interval.
    ## If multiple elements tie for the highest score, just grab one of them.
    ## Result is the current master list.  Probably don't need to sort, but do it anyway
    ## to be safe since we're not using --echo with bedmap call.
    bedmap --max-element $tmpd/tmpm.bed $tmpd/tmp.bed \
        | sort-bed - \
        > $tmpd/$iters.bed
    solns="$solns $tmpd/$iters.bed"
    echo "Adding `awk 'END { print NR }' $tmpd/$iters.bed` elements"

    ## Are there any elements that don't overlap the current master
    ## list?  If so, add those in, and repeat.  If not, we're done.
    bedops -n 1 $tmpd/tmp.bed $tmpd/$iters.bed \
       > $tmpd/tmp2.bed

    mv $tmpd/tmp2.bed $tmpd/tmp.bed

    if [ ! -s $tmpd/tmp.bed ]
    then
        stop=1
    fi

    ((iters++))
done

## final solution
bedops -u $solns \
   > $out

## Clean up
rm -r $tmpd

exit 0

3.9 拓展

那么,到这里就完成了主要的上游 ATAC-seq 数据分析。对于下游分析,人们可能希望注释peaks、寻找转录因子的富集基序、比较不同条件下的peaks,以及将 ATAC-seq 数据与其他数据类型(如 RNA-seq 等)结合起来。

(完)

后面会针对帖子中的细节写一些实战讨论~

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2025-02-27,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 生信技能树 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 第一个错误:
  • 第二个错误:https://github.com/broadinstitute/gatk/issues/8904
  • Non-unique alignments
  • Others
  • Merging BAMs (可选)
  • 3.3 Shifting reads
  • 3.4 Peak calling:MACS2
  • 3.5 生成browser tracks
  • 3.6 QC质量检查
  • 3.7 peaks过滤:Blacklist
  • 3.8 合并peaks(可选)
  • 3.9 拓展
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档