根据GATK官网有关于Resource bundle的说明https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle)
由于众所周知的原因Google Buckets的资源并不好下载,这里也不讨论这种下载方式,而根据官网的说明,FTP下载方式从2020年开始已经关闭了(实测现在的GATK resource bundle FTP服务器已经凉透了)。
三种下载方式只剩下一种,下面就分享一下如何从Azure下载Resource bundle。
Azure的资源信息在GATK Resource Bundle(https://learn.microsoft.com/en-us/azure/open-datasets/dataset-gatk-resource-bundle)中, 重点关注的是各种Resource bundle的Data Access地址:
datasetgatkbestpractices
West US 2: 'https://datasetgatkbestpractices.blob.core.windows.net/dataset'
West Central US: 'https://datasetgatkbestpractices-secondary.blob.core.windows.net/dataset'
SAS Token: ?sv=2020-04-08&si=prod&sr=c&sig=6SaDfKtXAIfdpO%2BkvNA%2FsTNmNij%2Byh%2F%2F%2Bf98WAUqs7I%3D
datasetgatklegacybundles
West US 2: 'https://datasetgatklegacybundles.blob.core.windows.net/dataset'
West Central US: 'https://datasetgatklegacybundles-secondary.blob.core.windows.net/dataset'
SAS Token: ?sv=2020-04-08&si=prod&sr=c&sig=xBfxOPBqHKUCszzwbNCBYF0k9osTQjKnZbEjXCW7gU0%3D
datasetgatktestdata
West US 2: 'https://datasetgatktestdata.blob.core.windows.net/dataset'
West Central US: 'https://datasetgatktestdata-secondary.blob.core.windows.net/dataset'
SAS Token: ?sv=2020-04-08&si=prod&sr=c&sig=fzLts1Q2vKjuvR7g50vE4HteEHBxTcJbNvf%2FZCeDMO4%3D
datasetpublicbroadref
West US 2: 'https://datasetpublicbroadref.blob.core.windows.net/dataset'
West Central US: 'https://datasetpublicbroadref-secondary.blob.core.windows.net/dataset'
SAS Token: ?sv=2020-04-08&si=prod&sr=c&sig=DQxmjB4D1lAfOW9AxIWbXwZx6ksbwjlNkixw597JnvQ%3D
datasetbroadpublic
West US 2: 'https://datasetbroadpublic.blob.core.windows.net/dataset'
West Central US: 'https://datasetbroadpublic-secondary.blob.core.windows.net/dataset'
SAS Token: ?sv=2020-04-08&si=prod&sr=c&sig=u%2Bg2Ab7WKZEGiAkwlj6nKiEeZ5wdoJb10Az7uUwis%2Fg%3D
可以看到共有5种资源可供下载,他们的具体解释参见GATK resource bundle官网。
可以发现每一种资源有两个服务器地址分别是West US 2地址和West Central US地址,同时还给出了token值,只需要将两者合并在一起就是一个合法的Azure资源地址。
Azure资源下载需要azcopy工具,azcopy的安装与使用详情见https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10。
此处以linux版本azcopy为例下载datasetgatklegacybundles资源。
下载安装azcopy
wget https://aka.ms/downloadazcopy-v10-linux
mv downloadazcopy-v10-linux az-copy.tar
tar -xvf az-copy.tar
# add azcopy into PATH
# echo 'export PATH=$PATH:you_azcopy_path' >> ~/.bashrc
# source ~/.bashrc
下载resource bundle资源
首先将datasetgatklegacybundles的资源地址拼接好:
# West US 2
gatklegacybundles_http_1='https://datasetgatklegacybundles.blob.core.windows.net/dataset?sv=2020-04-08&si=prod&sr=c&sig=xBfxOPBqHKUCszzwbNCBYF0k9osTQjKnZbEjXCW7gU0%3D'
# West Central US
gatklegacybundles_http_2='https://datasetgatklegacybundles-secondary.blob.core.windows.net/dataset?sv=2020-04-08&si=prod&sr=c&sig=xBfxOPBqHKUCszzwbNCBYF0k9osTQjKnZbEjXCW7gU0%3D'
然后使用azcopy下载:
# azcopy copy [source] [destination] [flags]
azcopy copy $gatklegacybundles_http_1 gatklegacybundles --recursive
由于是文件夹,记得要加上--recursive参数,然后就可以等待资源下载就绪了。