之前我们给大家介绍了两篇ATAC-Seq数据分析pipeline的优秀综述:综述:ATAC-Seq 数据分析工具大全 和 Omni-ATAC:更新和优化的ATAC-seq协议(NatProtoc),我们今天就来实战介绍!
这次选择介绍的资源如下,这些内容是通过互联网上的多种资源(论坛、论文、研讨会等)整理而成的。作者无法一一列出所有来源,但想要感谢他们分享经验、知识和代码:
网址:https://yiweiniu.github.io/blog/2019/03/ATAC-seq-data-analysis-from-FASTQ-to-peaks/
续接上一个帖子:一个优秀的ATAC-seq数据分析资源实战(一)
上次运行到使用Picard的MarkDuplicates工具去除PCR重复的reads时,这里报错:
java -XX:ParallelGCThreads=30 -Djava.io.tmpdir=/tmp -jar /nas2/zhangj/biosoft/picard/picard.jar MarkDuplicates QUIET=true INPUT=SRR14205836.rmChrM.bam OUTPUT=SRR14205836.rmDup.bam METRICS_FILE=SRR14205836.sorted.metrics REMOVE_DUPLICATES=true CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT TMP_DIR=/tmp
java -XX:ParallelGCThreads=30 -Djava.io.tmpdir=/tmp -jar /nas2/zhangj/biosoft/picard/picard.jar MarkDuplicates QUIET=true INPUT=SRR14205837.rmChrM.bam OUTPUT=SRR14205837.rmDup.bam METRICS_FILE=SRR14205837.sorted.metrics REMOVE_DUPLICATES=true CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT TMP_DIR=/tmp
(rna) [zhangj@biowisdom Mapping]$ java -XX:ParallelGCThreads=30 -Djava.io.tmpdir=/tmp -jar /nas2/zhangj/biosoft/picard/picard.jar MarkDuplicates QUIET=true INPUT=SRR14205836.rmChrM.bam OUTPUT=SRR14205836.rmDup.bam METRICS_FILE=SRR14205836.sorted.metrics REMOVE_DUPLICATES=true CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT TMP_DIR=/tmp
INFO 2025-02-26 19:03:33 MarkDuplicates
********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
**********
https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
********** MarkDuplicates -QUIET true -INPUT SRR14205836.rmChrM.bam -OUTPUT SRR14205836.rmDup.bam -METRICS_FILE SRR14205836.sorted.metrics -REMOVE_DUPLICATES true -CREATE_INDEX true -VALIDATION_STRINGENCY LENIENT -TMP_DIR /tmp
**********
19:03:33.717 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/nas2/zhangj/biosoft/picard/picard.jar!/com/intel/gkl/native/libgkl_compression.so
INFO 2025-02-26 19:03:33 MarkDuplicates Start of doWork freeMemory: 274413664; totalMemory: 285212672; maxMemory: 31675383808
INFO 2025-02-26 19:03:33 MarkDuplicates Reading input file and constructing read end information.
INFO 2025-02-26 19:03:33 MarkDuplicates Will retain up to 114765883 data points before spilling to disk.
WARNING 2025-02-26 19:03:34 AbstractOpticalDuplicateFinderCommandLineProgram A field field parsed out of a read name was expected to contain an integer and did not. Read name: SRR14205836.1843967. Cause: String 'SRR14205836.1843967' did not start with a parsable number.
Exception in thread "main" java.lang.NullPointerException: Cannot invoke "htsjdk.samtools.SAMReadGroupRecord.getReadGroupId()" because the return value of "htsjdk.samtools.SAMRecord.getReadGroup()" is null
at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:558)
at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:270)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:281)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:105)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:115)
这个提示告诉我,Picard的命令已经发生了变化,可前往:https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition) 进行确认。
更新后的代码:参数全部由=
号变成了--
或-
'进行传参
samtools view -H SRR14205836.rmChrM.bam | grep '@RG'
# @RG\tID:AB318-005\tLB:AB318-005\tPL:ILLUMINA\tSM:AB318-005
# @RG\tID:ReadGroup1\tSM:SampleName\tPL:Illumina\tLB:Library.fa
# You can set ReadGroup1 and SampleName and Library.fa to something really existing in your dataset (lane number, real sample name and the name of the fasta file)
# 运行下面两句
samtools addreplacerg -r "@RG\tID:SRR14205836\tSM:SRR14205836\tPL:Illumina\tLB:SRR14205836" -o SRR14205836.rmChrM-1.bam SRR14205836.rmChrM.bam
samtools addreplacerg -r "@RG\tID:SRR14205837\tSM:SRR14205837\tPL:Illumina\tLB:SRR14205837" -o SRR14205837.rmChrM-1.bam SRR14205837.rmChrM.bam
更新后的代码:
# 生成 mkDup.sh
# REMOVE_DUPLICATES=false: mark duplicate reads, not remove.
# Change it to true to remove duplicate reads.
cat ../data/cleandata/ID | while read id
do
echo "java -XX:ParallelGCThreads=30 -Xmx50g -Djava.io.tmpdir=/tmp -jar /nas2/zhangj/biosoft/picard/picard.jar MarkDuplicates --QUIET true -I ${id}.rmChrM-1.bam -O ${id}.mkDup.bam --METRICS_FILE ${id}.sorted.metrics --CREATE_INDEX true --VALIDATION_STRINGENCY LENIENT --TMP_DIR /tmp --ASSUME_SORT_ORDER coordinate"
done >mkDup.sh
# 运行rmDup.sh, 使用parafly进行并行任务投递
nohup ParaFly -c mkDup.sh -CPU 2 1>mkDup.log 2>&1 &
mkDup.sh的内容:
java -XX:ParallelGCThreads=30 -Xmx50g -Djava.io.tmpdir=/tmp -jar /nas2/zhangj/biosoft/picard/picard.jar MarkDuplicates --QUIET true -I SRR14205836.rmChrM-1.bam -O SRR14205836.mkDup.bam --METRICS_FILE SRR14205836.sorted.metrics --CREATE_INDEX true --VALIDATION_STRINGENCY LENIENT --TMP_DIR /tmp --ASSUME_SORT_ORDER coordinate
java -XX:ParallelGCThreads=30 -Xmx50g -Djava.io.tmpdir=/tmp -jar /nas2/zhangj/biosoft/picard/picard.jar MarkDuplicates --QUIET true -I SRR14205837.rmChrM.bam -O SRR14205837.mkDup.bam --METRICS_FILE SRR14205837.sorted.metrics --CREATE_INDEX true --VALIDATION_STRINGENCY LENIENT --TMP_DIR /tmp --ASSUME_SORT_ORDER coordinate
Ref: Harvard FAS Informatics - ATAC-seq Guidelines
一些研究人员选择移除那些非唯一比对的读段,这可以通过使用
samtools view
命令的-q
参数来实现。
不同的基因组比对工具在实现比对质量(MAPQ)时存在差异。可以参考《MAPQ分数的更多乱象(又名:why bioinformaticians hate poor and incomplete software documentation:https://www.acgt.me/?offset=1426809676847).》这篇文章。
因此,当使用MAPQ过滤非唯一比对时,一定要检查所使用的比对工具的MAPQ值。
对于 Bowtie2,可以使用MAPQ > 30. 下面是示例代码,我这里就不运行了:
# Remove multi-mapped reads (i.e. those with MAPQ < 30, using -q in SAMtools)
samtools view -h -q 30 ${sample}.bam > ${sample}.rmMulti.bam
在ENCODE的分析流程或某些论文中,还移除了以下类型的读段(通过 samtools
的 flag
参数,值为1796或1804)。
剩下的reads就是所谓的:“properly mapped reads”. 下面是示例代码,我这里就不运行了:
# Remove reads unmapped, mate unmapped, not primary alignment, reads failing platform, duplicates (-F 1804)
# Retain properly paired reads -f 2
samtools view -h -b -F 1804 -f 2 ${sample}.bam > ${sample}.filtered.bam
当一个实验条件构建了多个文库(即一个实验条件包含多个生物学和/或技术重复)时,可能需要在识别peaks之前合并不同的BAM文件(例如,合并技术重复的BAM文件,或合并BAM文件以获得更大的测序深度)。
关于“合并BAM文件”或“合并peaks”的考虑已经有所讨论:
合并bam的示例代码如下:
samtools merge -@ $PPN condition1.merged.bam sample1.bam sample2.bam sample3.bam
samtools index -@ $PPN condition1.merged.bam
在第一篇ATAC-seq的文献(https://www.nature.com/articles/nmeth.2688)中,由于Tn5转座酶以二聚体形式结合并插入两个相隔9个碱基对的接头,因此所有比对到正链(+ strand)的读段向右偏移+4个碱基对,所有比对到负链(– strand)的读段向左偏移−5个碱基对。
Ref: shifting reads bam for NucleoATAC?
然而,对于peak calling,reads的偏移可能并不是非常重要,因为这只是一个非常微小的调整,而 peaks 通常跨越数百个碱基对。只有在需要精确到单个碱基分辨率的插入位置时,例如转录因子结合位点(TF motif)足迹分析(footprinting),偏移才是至关重要的。
此外,还要记住,并非所有转录因子足迹分析工具都需要偏移后的 reads。其中一些工具可能已经在内部完成了这一处理,例如 NucleoATAC。
调整的示例代码如下:
Ref: Shifting reads for ATAC-seq alignments(https://www.biostars.org/p/187204/#187206)
###################################### 方法1:https://www.biostars.org/p/187204/#187206
# use bedtools and awk
# the BAM file should be sorted by read name beforehand
samtools sort -n -T aln.sorted -o aln.sorted.bam aln.bam
# The bedtools command should extract the paired-end alignments as bedpe format, then the awk command should shift the fragments as needed
bedtools bamtobed -i reads.bam -bedpe | awk -v OFS="\t" '{($9=="+"){print $1,$2+4,$6+4} \
($9=="-"){print $1,$2-5,$6-5}}' > fragments.bed
###################################### 方法2:https://deeptools.readthedocs.io/en/develop/content/tools/alignmentSieve.html
# use --ATACshift
alignmentSieve --numberOfProcessors 8 --ATACshift --bam sample1.bam -o sample1.tmp.bam
# the bam file needs to be sorted again
samtools sort -@ 8 -O bam -o sample1.shifted.bam sample1.tmp.bam
samtools index -@ 8 sample1.shifted.bam
rm sample1.tmp.bam
###################################### 方法3:https://bioconductor.org/packages/release/bioc/html/ATACseqQC.html
# use ATACseqQC
## load the library
library(ATACseqQC)
## input is bamFile
bamfile <- system.file("extdata", "GL1.bam", package="ATACseqQC", mustWork=TRUE)
bamfile.labels <- gsub(".bam", "", basename(bamfile))
## bamfile tags
tags <- c("AS", "XN", "XM", "XO", "XG", "NM", "MD", "YS", "YT")
## files will be output into outPath
outPath <- "splited"
dir.create(outPath)
## shift the bam file by the 5'ends
library(BSgenome.Hsapiens.UCSC.hg19)
seqlev <- "chr1" ## subsample data for quick run
which <- as(seqinfo(Hsapiens)[seqlev], "GRanges")
gal <- readBamFile(bamfile, tag=tags, which=which, asMates=TRUE)
gal1 <- shiftGAlignmentsList(gal)
shiftedBamfile <- file.path(outPath, "shifted.bam")
export(gal1, shiftedBamfile)
由于双端测序在ATAC-seq中被广泛使用,因此将通过-f
参数告知MACS2数据是成对的。通过这种方式,MACS2只会分析正确比对的reads(因为我们是在上述过滤后得到的bam文件)。fragments 由成对的比对来定义,不存在建模或人为延伸。
cd 2022-GSE171832-28-embryonal-atacseq/PeakCalling/
# 激活小环境
conda activate rna
# 安装 macs2
conda install macs2 -y
# -f BAMPE, use paired-end information
# --keep-dup all, keep all duplicate reads.
# macs2.sh 内容如下:
macs2 callpeak -f BAMPE -g hs --keep-dup all --cutoff-analysis -n SRR14205836 -t ../Mapping/SRR14205836.mkDup.bam --outdir SRR14205836/
macs2 callpeak -f BAMPE -g hs --keep-dup all --cutoff-analysis -n SRR14205837 -t ../Mapping/SRR14205837.mkDup.bam --outdir SRR14205837/
# 运行rmDup.sh, 使用parafly进行并行任务投递
nohup ParaFly -c macs2.sh -CPU 2 1>macs2.log 2>&1 &
运行后的结果如下:
├── SRR14205836
│ ├── SRR14205836_cutoff_analysis.txt
│ ├── SRR14205836_peaks.narrowPeak
│ ├── SRR14205836_peaks.xls
│ └── SRR14205836_summits.bed
└── SRR14205837
├── SRR14205837_cutoff_analysis.txt
├── SRR14205837_peaks.narrowPeak
├── SRR14205837_peaks.xls
└── SRR14205837_summits.bed
使用 deepTools
中的 bamCoverage
来创建用于可视化的 bigWig 文件。它提供了几种不同的信号标准化方法。
# bigWig.sh 内容如下
# bam to bigwig, normalize using 1x effective genome size
# effective genome size: https://deeptools.readthedocs.io/en/latest/content/feature/effectiveGenomeSize.html
bamCoverage --numberOfProcessors 30 --binSize 10 --normalizeUsing RPGC --effectiveGenomeSize 2652783500 --bam ../Mapping/SRR14205836.mkDup.bam -o SRR14205836.bw
bamCoverage --numberOfProcessors 30 --binSize 10 --normalizeUsing RPGC --effectiveGenomeSize 2652783500 --bam ../Mapping/SRR14205837.mkDup.bam -o SRR14205837.bw
# 运行bigWig.sh, 使用parafly进行并行任务投递
nohup ParaFly -c bigWig.sh -CPU 2 1>bigWig.log 2>&1 &
在处理ATAC-seq数据时,我们可以生成一些指标来检查数据/文库的质量。
参考:
以下是ENCODE所使用的标准,以下将详细解释每个术语的具体描述。
当前标准:
具体的每一点详细讨论,我们后面也会写一些实战笔记。
原稿讨论可见:https://yiweiniu.github.io/blog/2019/03/ATAC-seq-data-analysis-from-FASTQ-to-peaks/
可以使用 DAC Blacklisted Regions:https://www.encodeproject.org/annotations/ENCSR636HFF/ 对peaks进行过滤,下面的过滤代码来自:ENCODE - ATAC-seq Data Standards and Prototype Processing Pipeline(https://www.encodeproject.org/atac-seq/)
去上面的DAC Blacklisted Regions下载:files.txt
# 运行下载 得到 ENCFF547MET.bed.gz文件
xargs -n 1 curl -O -L < files.txt
# 这个文件中的 染色体编号 带有chr,我们前面使用的是数字编号,去掉
zless -S ENCFF547MET.bed.gz |sed 's/^chr//g' >ENCFF547MET.bed
过滤:
note:grep -P 'chr[0-9XY]+(?!_)'
用于匹配以 chr
开头且后面跟随数字或特定字母(X
或 Y
),但不以 _
结尾的字符串
# 样本 SRR14205836
bedtools intersect -v -a SRR14205836/SRR14205836_peaks.narrowPeak -b ENCFF547MET.bed \
| awk 'BEGIN{OFS="\t"} {if ($5>1000) $5=1000; print $0}' \
| grep -P '^[0-9XY]+(?!_)' | gzip -nc > SRR14205836/SRR14205836_peaks.narrowPeak.filt.gz
# 样本 SRR14205837
bedtools intersect -v -a SRR14205837/SRR14205837_peaks.narrowPeak -b ENCFF547MET.bed \
| awk 'BEGIN{OFS="\t"} {if ($5>1000) $5=1000; print $0}' \
| grep -P '^[0-9XY]+(?!_)' | gzip -nc > SRR14205837/SRR14205837_peaks.narrowPeak.filt.gz
narroPeak文件的第5列为一个整数打分:int(-10*log10qvalue)
,取值范围在 [0, 1000],awk 将>1000的变成了1000。
人们可能希望合并来自不同文库或不同样本的peaks。Corces et al., 2016(https://www.nature.com/articles/ng.3646)使用如下的方法合并:
To generate a non-redundant list of hematopoiesis- and cancer-related peaks, we first extended summits to 500-bp windows (±250 bp). We then ranked the 500-bp peaks by summit significance value (defined by MACS2) and chose a list of non-overlapping, maximally significant peaks.
可以使用 BEDOPS工具, (ref: Collapsing multiple BED files into a master list by signal)
下面是示例代码:
#!/bin/bash
# modified from https://bedops.readthedocs.io/en/latest/content/usage-examples/master-list.html
summit_bed=(sample1_summits.bed sample2_summits.bed sample3_summits.bed)
out=fAdrenal.master.merge.bed
tmpd=/tmp/tmp$$
mkdir -p $tmpd
## First, union all the peaks together into a single file.
bedlist=""
for bed in ${beds[*]}
do
bedlist="$bedlist $summit_bed"
done
# extended summits to 500-bp windows (±250 bp)
bedops --range 250 -u $bedlist > $tmpd/tmp.bed
## The master list is constructed iteratively. For each pass through
## the loop, elements not yet in the master list are merged into
## non-overlapping intervals that span the union (this is just bedops
## -m). Then for each merged interval, an original element of highest
## score within the interval is selected to go in the master list.
## Anything that overlaps the selected element is thrown out, and the
## process then repeats.
iters=1
solns=""
stop=0
while [ $stop == 0 ]
do
echo "merge steps..."
## Condense the union into merged intervals. This klugey bit
## before and after the merging is because we don't want to merge
## regions that are simply adjacent but not overlapping
bedops -m --range 0:-1 $tmpd/tmp.bed \
| bedops -u --range 0:1 - \
> $tmpd/tmpm.bed
## Grab the element with the highest score among all elements forming each interval.
## If multiple elements tie for the highest score, just grab one of them.
## Result is the current master list. Probably don't need to sort, but do it anyway
## to be safe since we're not using --echo with bedmap call.
bedmap --max-element $tmpd/tmpm.bed $tmpd/tmp.bed \
| sort-bed - \
> $tmpd/$iters.bed
solns="$solns $tmpd/$iters.bed"
echo "Adding `awk 'END { print NR }' $tmpd/$iters.bed` elements"
## Are there any elements that don't overlap the current master
## list? If so, add those in, and repeat. If not, we're done.
bedops -n 1 $tmpd/tmp.bed $tmpd/$iters.bed \
> $tmpd/tmp2.bed
mv $tmpd/tmp2.bed $tmpd/tmp.bed
if [ ! -s $tmpd/tmp.bed ]
then
stop=1
fi
((iters++))
done
## final solution
bedops -u $solns \
> $out
## Clean up
rm -r $tmpd
exit 0
那么,到这里就完成了主要的上游 ATAC-seq 数据分析。对于下游分析,人们可能希望注释peaks、寻找转录因子的富集基序、比较不同条件下的peaks,以及将 ATAC-seq 数据与其他数据类型(如 RNA-seq 等)结合起来。
(完)
后面会针对帖子中的细节写一些实战讨论~