一个优秀的ATAC-seq数据分析资源实战（二）

生信技能树

发布于 2025-02-28 14:45:31

9200

代码可运行

文章被收录于专栏：生信技能树生信技能树

运行总次数：0

代码可运行

之前我们给大家介绍了两篇ATAC-Seq数据分析pipeline的优秀综述：综述：ATAC-Seq 数据分析工具大全 和 Omni-ATAC：更新和优化的ATAC-seq协议（NatProtoc），我们今天就来实战介绍！

这次选择介绍的资源如下，这些内容是通过互联网上的多种资源（论坛、论文、研讨会等）整理而成的。作者无法一一列出所有来源，但想要感谢他们分享经验、知识和代码：

网址：https://yiweiniu.github.io/blog/2019/03/ATAC-seq-data-analysis-from-FASTQ-to-peaks/

续接上一个帖子：一个优秀的ATAC-seq数据分析资源实战（一）

上次运行到使用Picard的MarkDuplicates工具去除PCR重复的reads时，这里报错：

java -XX:ParallelGCThreads=30 -Djava.io.tmpdir=/tmp -jar /nas2/zhangj/biosoft/picard/picard.jar MarkDuplicates QUIET=true INPUT=SRR14205836.rmChrM.bam OUTPUT=SRR14205836.rmDup.bam METRICS_FILE=SRR14205836.sorted.metrics REMOVE_DUPLICATES=true CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT TMP_DIR=/tmp
java -XX:ParallelGCThreads=30 -Djava.io.tmpdir=/tmp -jar /nas2/zhangj/biosoft/picard/picard.jar MarkDuplicates QUIET=true INPUT=SRR14205837.rmChrM.bam OUTPUT=SRR14205837.rmDup.bam METRICS_FILE=SRR14205837.sorted.metrics REMOVE_DUPLICATES=true CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT TMP_DIR=/tmp
(rna) [zhangj@biowisdom Mapping]$ java -XX:ParallelGCThreads=30 -Djava.io.tmpdir=/tmp -jar /nas2/zhangj/biosoft/picard/picard.jar MarkDuplicates QUIET=true INPUT=SRR14205836.rmChrM.bam OUTPUT=SRR14205836.rmDup.bam METRICS_FILE=SRR14205836.sorted.metrics REMOVE_DUPLICATES=true CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT TMP_DIR=/tmp
INFO 2025-02-26 19:03:33 MarkDuplicates

********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
**********
https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
**********    MarkDuplicates -QUIET true -INPUT SRR14205836.rmChrM.bam -OUTPUT SRR14205836.rmDup.bam -METRICS_FILE SRR14205836.sorted.metrics -REMOVE_DUPLICATES true -CREATE_INDEX true -VALIDATION_STRINGENCY LENIENT -TMP_DIR /tmp
**********

19:03:33.717 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/nas2/zhangj/biosoft/picard/picard.jar!/com/intel/gkl/native/libgkl_compression.so
INFO 2025-02-26 19:03:33 MarkDuplicates Start of doWork freeMemory: 274413664; totalMemory: 285212672; maxMemory: 31675383808
INFO 2025-02-26 19:03:33 MarkDuplicates Reading input file and constructing read end information.
INFO 2025-02-26 19:03:33 MarkDuplicates Will retain up to 114765883 data points before spilling to disk.
WARNING 2025-02-26 19:03:34 AbstractOpticalDuplicateFinderCommandLineProgram A field field parsed out of a read name was expected to contain an integer and did not. Read name: SRR14205836.1843967. Cause: String 'SRR14205836.1843967' did not start with a parsable number.

Exception in thread "main" java.lang.NullPointerException: Cannot invoke "htsjdk.samtools.SAMReadGroupRecord.getReadGroupId()" because the return value of "htsjdk.samtools.SAMRecord.getReadGroup()" is null
 at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:558)
 at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:270)
 at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:281)
 at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:105)
 at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:115)

第一个错误：

这个提示告诉我，Picard的命令已经发生了变化，可前往：https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition) 进行确认。

更新后的代码：参数全部由=号变成了--或-'进行传参

第二个错误：https://github.com/broadinstitute/gatk/issues/8904

samtools view -H SRR14205836.rmChrM.bam | grep '@RG' 

# @RG\tID:AB318-005\tLB:AB318-005\tPL:ILLUMINA\tSM:AB318-005
# @RG\tID:ReadGroup1\tSM:SampleName\tPL:Illumina\tLB:Library.fa
# You can set ReadGroup1 and SampleName and Library.fa to something really existing in your dataset (lane number, real sample name and the name of the fasta file)
# 运行下面两句
samtools addreplacerg -r "@RG\tID:SRR14205836\tSM:SRR14205836\tPL:Illumina\tLB:SRR14205836" -o SRR14205836.rmChrM-1.bam  SRR14205836.rmChrM.bam
samtools addreplacerg -r "@RG\tID:SRR14205837\tSM:SRR14205837\tPL:Illumina\tLB:SRR14205837" -o SRR14205837.rmChrM-1.bam  SRR14205837.rmChrM.bam

更新后的代码：

# 生成 mkDup.sh
# REMOVE_DUPLICATES=false: mark duplicate reads, not remove.
# Change it to true to remove duplicate reads.
cat ../data/cleandata/ID | while read id
do
  echo "java -XX:ParallelGCThreads=30 -Xmx50g -Djava.io.tmpdir=/tmp -jar /nas2/zhangj/biosoft/picard/picard.jar MarkDuplicates --QUIET  true -I ${id}.rmChrM-1.bam -O ${id}.mkDup.bam --METRICS_FILE ${id}.sorted.metrics --CREATE_INDEX true --VALIDATION_STRINGENCY LENIENT --TMP_DIR /tmp --ASSUME_SORT_ORDER coordinate"
done >mkDup.sh

# 运行rmDup.sh, 使用parafly进行并行任务投递
nohup ParaFly -c mkDup.sh -CPU 2 1>mkDup.log  2>&1 &

mkDup.sh的内容：

java -XX:ParallelGCThreads=30 -Xmx50g -Djava.io.tmpdir=/tmp -jar /nas2/zhangj/biosoft/picard/picard.jar MarkDuplicates --QUIET  true -I SRR14205836.rmChrM-1.bam -O SRR14205836.mkDup.bam --METRICS_FILE SRR14205836.sorted.metrics --CREATE_INDEX true --VALIDATION_STRINGENCY LENIENT --TMP_DIR /tmp --ASSUME_SORT_ORDER coordinate
java -XX:ParallelGCThreads=30 -Xmx50g -Djava.io.tmpdir=/tmp -jar /nas2/zhangj/biosoft/picard/picard.jar MarkDuplicates --QUIET  true -I SRR14205837.rmChrM.bam -O SRR14205837.mkDup.bam --METRICS_FILE SRR14205837.sorted.metrics --CREATE_INDEX true --VALIDATION_STRINGENCY LENIENT --TMP_DIR /tmp --ASSUME_SORT_ORDER coordinate

Non-unique alignments

Ref: Harvard FAS Informatics - ATAC-seq Guidelines

一些研究人员选择移除那些非唯一比对的读段，这可以通过使用 samtools view 命令的 -q 参数来实现。

不同的基因组比对工具在实现比对质量（MAPQ）时存在差异。可以参考《MAPQ分数的更多乱象（又名：why bioinformaticians hate poor and incomplete software documentation：https://www.acgt.me/?offset=1426809676847）.》这篇文章。

因此，当使用MAPQ过滤非唯一比对时，一定要检查所使用的比对工具的MAPQ值。

对于 Bowtie2，可以使用MAPQ > 30. 下面是示例代码，我这里就不运行了：

# Remove multi-mapped reads (i.e. those with MAPQ < 30, using -q in SAMtools)
samtools view -h -q 30 ${sample}.bam > ${sample}.rmMulti.bam

Others

在ENCODE的分析流程或某些论文中，还移除了以下类型的读段（通过 samtools 的 flag 参数，值为1796或1804）。

reads unmapped,
not primary alignment
reads failing platform
duplicates

剩下的reads就是所谓的：“properly mapped reads”. 下面是示例代码，我这里就不运行了：

# Remove reads unmapped, mate unmapped, not primary alignment, reads failing platform, duplicates (-F 1804)
# Retain properly paired reads -f 2
samtools view -h -b -F 1804 -f 2 ${sample}.bam > ${sample}.filtered.bam

Merging BAMs (可选)

当一个实验条件构建了多个文库（即一个实验条件包含多个生物学和/或技术重复）时，可能需要在识别peaks之前合并不同的BAM文件（例如，合并技术重复的BAM文件，或合并BAM文件以获得更大的测序深度）。

关于“合并BAM文件”或“合并peaks”的考虑已经有所讨论：

ChIP-Seq: Calling peaks with replicates：https://www.biostars.org/p/112778/
how to pool together biological replicates?：https://www.biostars.org/p/191474/
ChipSeq: merge bam file before peak calling：https://www.biostars.org/p/210564/
Chip-Seq merging peak files：https://www.biostars.org/p/230055/

合并bam的示例代码如下：

samtools merge -@ $PPN condition1.merged.bam sample1.bam sample2.bam sample3.bam
samtools index -@ $PPN condition1.merged.bam

3.3 Shifting reads

在第一篇ATAC-seq的文献（https://www.nature.com/articles/nmeth.2688）中，由于Tn5转座酶以二聚体形式结合并插入两个相隔9个碱基对的接头，因此所有比对到正链（+ strand）的读段向右偏移+4个碱基对，所有比对到负链（– strand）的读段向左偏移−5个碱基对。

Ref: shifting reads bam for NucleoATAC?

然而，对于peak calling，reads的偏移可能并不是非常重要，因为这只是一个非常微小的调整，而 peaks 通常跨越数百个碱基对。只有在需要精确到单个碱基分辨率的插入位置时，例如转录因子结合位点（TF motif）足迹分析（footprinting），偏移才是至关重要的。

此外，还要记住，并非所有转录因子足迹分析工具都需要偏移后的 reads。其中一些工具可能已经在内部完成了这一处理，例如 NucleoATAC。

调整的示例代码如下：

Ref: Shifting reads for ATAC-seq alignments（https://www.biostars.org/p/187204/#187206）

###################################### 方法1：https://www.biostars.org/p/187204/#187206
# use bedtools and awk
# the BAM file should be sorted by read name beforehand
samtools sort -n -T aln.sorted -o aln.sorted.bam aln.bam

# The bedtools command should extract the paired-end alignments as bedpe format, then the awk command should shift the fragments as needed
bedtools bamtobed -i reads.bam -bedpe | awk -v OFS="\t" '{($9=="+"){print $1,$2+4,$6+4} \
  ($9=="-"){print $1,$2-5,$6-5}}' > fragments.bed
   
###################################### 方法2：https://deeptools.readthedocs.io/en/develop/content/tools/alignmentSieve.html
# use --ATACshift
alignmentSieve --numberOfProcessors 8 --ATACshift --bam sample1.bam -o sample1.tmp.bam

# the bam file needs to be sorted again
samtools sort -@ 8 -O bam -o sample1.shifted.bam sample1.tmp.bam
samtools index -@ 8 sample1.shifted.bam
rm sample1.tmp.bam

###################################### 方法3：https://bioconductor.org/packages/release/bioc/html/ATACseqQC.html
# use ATACseqQC
## load the library
library(ATACseqQC)

## input is bamFile
bamfile <- system.file("extdata", "GL1.bam", package="ATACseqQC", mustWork=TRUE)
bamfile.labels <- gsub(".bam", "", basename(bamfile))

## bamfile tags
tags <- c("AS", "XN", "XM", "XO", "XG", "NM", "MD", "YS", "YT")
## files will be output into outPath
outPath <- "splited"
dir.create(outPath)
## shift the bam file by the 5'ends
library(BSgenome.Hsapiens.UCSC.hg19)
seqlev <- "chr1" ## subsample data for quick run
which <- as(seqinfo(Hsapiens)[seqlev], "GRanges")
gal <- readBamFile(bamfile, tag=tags, which=which, asMates=TRUE)
gal1 <- shiftGAlignmentsList(gal)
shiftedBamfile <- file.path(outPath, "shifted.bam")
export(gal1, shiftedBamfile)

3.4 Peak calling：MACS2

由于双端测序在ATAC-seq中被广泛使用，因此将通过-f参数告知MACS2数据是成对的。通过这种方式，MACS2只会分析正确比对的reads（因为我们是在上述过滤后得到的bam文件）。fragments 由成对的比对来定义，不存在建模或人为延伸。

cd 2022-GSE171832-28-embryonal-atacseq/PeakCalling/
# 激活小环境
conda activate rna
# 安装 macs2
conda install macs2 -y

# -f BAMPE, use paired-end information
# --keep-dup all, keep all duplicate reads.
# macs2.sh 内容如下：
macs2 callpeak -f BAMPE -g hs --keep-dup all --cutoff-analysis -n SRR14205836 -t ../Mapping/SRR14205836.mkDup.bam --outdir SRR14205836/
macs2 callpeak -f BAMPE -g hs --keep-dup all --cutoff-analysis -n SRR14205837 -t ../Mapping/SRR14205837.mkDup.bam --outdir SRR14205837/

# 运行rmDup.sh, 使用parafly进行并行任务投递
nohup ParaFly -c macs2.sh -CPU 2 1>macs2.log  2>&1 &

运行后的结果如下：

├── SRR14205836
│   ├── SRR14205836_cutoff_analysis.txt
│   ├── SRR14205836_peaks.narrowPeak
│   ├── SRR14205836_peaks.xls
│   └── SRR14205836_summits.bed
└── SRR14205837
    ├── SRR14205837_cutoff_analysis.txt
    ├── SRR14205837_peaks.narrowPeak
    ├── SRR14205837_peaks.xls
    └── SRR14205837_summits.bed

3.5 生成browser tracks

使用 deepTools 中的 bamCoverage 来创建用于可视化的 bigWig 文件。它提供了几种不同的信号标准化方法。

# bigWig.sh 内容如下
# bam to bigwig, normalize using 1x effective genome size
# effective genome size: https://deeptools.readthedocs.io/en/latest/content/feature/effectiveGenomeSize.html
bamCoverage --numberOfProcessors 30 --binSize 10 --normalizeUsing RPGC --effectiveGenomeSize 2652783500 --bam ../Mapping/SRR14205836.mkDup.bam -o SRR14205836.bw
bamCoverage --numberOfProcessors 30 --binSize 10 --normalizeUsing RPGC --effectiveGenomeSize 2652783500 --bam ../Mapping/SRR14205837.mkDup.bam -o SRR14205837.bw

# 运行bigWig.sh, 使用parafly进行并行任务投递
nohup ParaFly -c bigWig.sh -CPU 2 1>bigWig.log  2>&1 &

3.6 QC质量检查

在处理ATAC-seq数据时，我们可以生成一些指标来检查数据/文库的质量。

参考：

ENCODE - Terms and Definitions：https://www.encodeproject.org/data-standards/terms/#enrichment
ENCODE - ATAC-seq Data Standards and Prototype Processing Pipeline：https://www.encodeproject.org/atac-seq/

以下是ENCODE所使用的标准，以下将详细解释每个术语的具体描述。

当前标准：

实验需要有>=2个生物学重复
每个重复数据量至少：25M 非重复reads、非线粒体比对reads（单端数据），双端数据为50M，也就是25M 如果按照一对reads算1个fragment。
比对率：大于95%，最低大于80%可接受范围
重复样本一致性：计算一个IDR值（lrreproducible Discovery Rate），需要 rescue 和 self 一致性率 <2
文库复杂度：NRF>0.9,PBC1>0.9, and PBC2>3
peaks 结果文件的标准：
- 每个样本中的peaks数 > 150000, 最少>100000
- 满足IDR的peaks数 > 70000, 最少 > 50000
- 必须存在一个无核小体区域（Nucleosome-Free Region, NFR）。
- 片段长度分布中必须存在单核小体峰。这些读段跨越单个核小体，因此其长度大于147 bp但小于147 bp × 2。优质的ATAC-seq数据集包含跨越核小体的reads（这使得除了能够识别染色质的开放区域外，还可以调用核小体的位置）。
FRiP score > 0.3，最少>0.2。对于ENCODE的组织样本，FRiP score 不会被强制作为质量控制指标，但转录起始位点（TSS）富集度仍然作为衡量信噪比的关键指标。
转录起始位点（TSS）富集值取决于所使用的参考文件：高质量数据的截止值如下表所示

实验必须通过常规的元数据审核才能发布。

具体的每一点详细讨论，我们后面也会写一些实战笔记。

原稿讨论可见：https://yiweiniu.github.io/blog/2019/03/ATAC-seq-data-analysis-from-FASTQ-to-peaks/

3.7 peaks过滤：Blacklist

可以使用 DAC Blacklisted Regions：https://www.encodeproject.org/annotations/ENCSR636HFF/ 对peaks进行过滤，下面的过滤代码来自：ENCODE - ATAC-seq Data Standards and Prototype Processing Pipeline（https://www.encodeproject.org/atac-seq/）

去上面的DAC Blacklisted Regions下载：files.txt

# 运行下载 得到 ENCFF547MET.bed.gz文件
xargs -n 1 curl -O -L < files.txt

# 这个文件中的 染色体编号 带有chr，我们前面使用的是数字编号，去掉
zless -S ENCFF547MET.bed.gz  |sed 's/^chr//g' >ENCFF547MET.bed

过滤：

note：grep -P 'chr[0-9XY]+(?!_)' 用于匹配以 chr 开头且后面跟随数字或特定字母（X 或 Y），但不以 _ 结尾的字符串

# 样本 SRR14205836
bedtools intersect -v -a SRR14205836/SRR14205836_peaks.narrowPeak -b ENCFF547MET.bed \
  | awk 'BEGIN{OFS="\t"} {if ($5>1000) $5=1000; print $0}' \
  | grep -P '^[0-9XY]+(?!_)' | gzip -nc > SRR14205836/SRR14205836_peaks.narrowPeak.filt.gz

# 样本 SRR14205837
bedtools intersect -v -a SRR14205837/SRR14205837_peaks.narrowPeak -b ENCFF547MET.bed \
  | awk 'BEGIN{OFS="\t"} {if ($5>1000) $5=1000; print $0}' \
  | grep -P '^[0-9XY]+(?!_)' | gzip -nc > SRR14205837/SRR14205837_peaks.narrowPeak.filt.gz

narroPeak文件的第5列为一个整数打分：int(-10*log10qvalue)，取值范围在 [0, 1000]，awk 将>1000的变成了1000。

3.8 合并peaks（可选）

人们可能希望合并来自不同文库或不同样本的peaks。Corces et al., 2016（https://www.nature.com/articles/ng.3646）使用如下的方法合并：

To generate a non-redundant list of hematopoiesis- and cancer-related peaks, we first extended summits to 500-bp windows (±250 bp). We then ranked the 500-bp peaks by summit significance value (defined by MACS2) and chose a list of non-overlapping, maximally significant peaks.

可以使用 BEDOPS工具， (ref: Collapsing multiple BED files into a master list by signal)

下面是示例代码：

#!/bin/bash

# modified from https://bedops.readthedocs.io/en/latest/content/usage-examples/master-list.html

summit_bed=(sample1_summits.bed sample2_summits.bed sample3_summits.bed)

out=fAdrenal.master.merge.bed

tmpd=/tmp/tmp$$
mkdir -p $tmpd

## First, union all the peaks together into a single file.
bedlist=""
for bed in ${beds[*]}
do
    bedlist="$bedlist $summit_bed"
done

# extended summits to 500-bp windows (±250 bp)
bedops --range 250 -u $bedlist > $tmpd/tmp.bed

## The master list is constructed iteratively.  For each pass through
## the loop, elements not yet in the master list are merged into
## non-overlapping intervals that span the union (this is just bedops
## -m).  Then for each merged interval, an original element of highest
## score within the interval is selected to go in the master list.
## Anything that overlaps the selected element is thrown out, and the
## process then repeats.
iters=1
solns=""
stop=0
while [ $stop == 0 ]
do
    echo "merge steps..."

    ## Condense the union into merged intervals. This klugey bit
    ## before and after the merging is because we don't want to merge
    ## regions that are simply adjacent but not overlapping
    bedops -m --range 0:-1 $tmpd/tmp.bed \
        | bedops -u --range 0:1 - \
        > $tmpd/tmpm.bed

    ## Grab the element with the highest score among all elements forming each interval.
    ## If multiple elements tie for the highest score, just grab one of them.
    ## Result is the current master list.  Probably don't need to sort, but do it anyway
    ## to be safe since we're not using --echo with bedmap call.
    bedmap --max-element $tmpd/tmpm.bed $tmpd/tmp.bed \
        | sort-bed - \
        > $tmpd/$iters.bed
    solns="$solns $tmpd/$iters.bed"
    echo "Adding `awk 'END { print NR }' $tmpd/$iters.bed` elements"

    ## Are there any elements that don't overlap the current master
    ## list?  If so, add those in, and repeat.  If not, we're done.
    bedops -n 1 $tmpd/tmp.bed $tmpd/$iters.bed \
       > $tmpd/tmp2.bed

    mv $tmpd/tmp2.bed $tmpd/tmp.bed

    if [ ! -s $tmpd/tmp.bed ]
    then
        stop=1
    fi

    ((iters++))
done

## final solution
bedops -u $solns \
   > $out

## Clean up
rm -r $tmpd

exit 0