我们马拉松授课有个小伙伴问 arrayexpress 数据库的文件如何下载,因为我们给大家演示的使用 GEO 数据库下载的帖子比较多,这就来看看!
简单学习一下这个数据库的背景知识,问kimi:介绍一下arrayexpress数据库,它与GEO数据库相比有什么不同?
ArrayExpress 是由欧洲生物信息学研究所(EBI)维护的一个功能基因组学数据公共存储库,主要用于存储和提供高通量功能基因组学数据。它支持多种数据类型,包括基因表达数据(主要基于微阵列和高通量测序平台)、染色质免疫沉淀测序(ChIP-seq)、基因分型等。
ArrayExpress和GEO都是功能基因组学研究的重要资源,但它们在数据存储、提交、校对、访问和标准化方面存在一些差异。ArrayExpress在数据标准化和校对方面更为严格,而GEO则在数据提交和访问方面更为便捷。研究人员可以根据自己的需求选择合适的数据库进行数据存储和分析。
我们以下面这篇文献中的数据为例:《Single-cell transcriptomics reveals cellularheterogeneity and molecular stratificationof cervical cancer》,单细胞数据编号为 E-MTAB-11948:
Single-cell RNA sequencing gene expression data generated in this study has been deposited in the ArrayExpress database with accession of E-MTAB-11948. Any other data are available from the corresponding author on reasonable request. Software and resources used for analysis and plotting are described in each method section.
点击 Download all files
按钮:
这里数据库给了三个平台的下载方式,我这里选择使用服务器进行下载,选择如下:
得到下载命令文件 E-MTAB-11948-unix-aspera.sh,文件内容如下:
ascp -P33001 -i "C:/aspera/cli/etc/asperaweb_id_dsa.openssh" --host=fasp.ebi.ac.uk --user=bsaspera --mode=recv "fire/E-MTAB-/948/E-MTAB-11948/Files/Sample4.csv" ./
ascp -P33001 -i "C:/aspera/cli/etc/asperaweb_id_dsa.openssh" --host=fasp.ebi.ac.uk --user=bsaspera --mode=recv "fire/E-MTAB-/948/E-MTAB-11948/Files/sample4features.tsv.gz" ./
ascp -P33001 -i "C:/aspera/cli/etc/asperaweb_id_dsa.openssh" --host=fasp.ebi.ac.uk --user=bsaspera --mode=recv "fire/E-MTAB-/948/E-MTAB-11948/Files/sample1features.tsv.gz" ./
ascp -P33001 -i "C:/aspera/cli/etc/asperaweb_id_dsa.openssh" --host=fasp.ebi.ac.uk --user=bsaspera --mode=recv "fire/E-MTAB-/948/E-MTAB-11948/Files/Sample5.csv" ./
ascp -P33001 -i "C:/aspera/cli/etc/asperaweb_id_dsa.openssh" --host=fasp.ebi.ac.uk --user=bsaspera --mode=recv "fire/E-MTAB-/948/E-MTAB-11948/Files/Sample3.csv" ./
ascp -P33001 -i "C:/aspera/cli/etc/asperaweb_id_dsa.openssh" --host=fasp.ebi.ac.uk --user=bsaspera --mode=recv "fire/E-MTAB-/948/E-MTAB-11948/Files/sample1matrix.mtx.gz" ./
ascp -P33001 -i "C:/aspera/cli/etc/asperaweb_id_dsa.openssh" --host=fasp.ebi.ac.uk --user=bsaspera --mode=recv "fire/E-MTAB-/948/E-MTAB-11948/Files/sample3features.tsv.gz" ./
ascp -P33001 -i "C:/aspera/cli/etc/asperaweb_id_dsa.openssh" --host=fasp.ebi.ac.uk --user=bsaspera --mode=recv "fire/E-MTAB-/948/E-MTAB-11948/Files/Sample6.csv" ./
ascp -P33001 -i "C:/aspera/cli/etc/asperaweb_id_dsa.openssh" --host=fasp.ebi.ac.uk --user=bsaspera --mode=recv "fire/E-MTAB-/948/E-MTAB-11948/Files/sample2barcodes.tsv.gz" ./
ascp -P33001 -i "C:/aspera/cli/etc/asperaweb_id_dsa.openssh" --host=fasp.ebi.ac.uk --user=bsaspera --mode=recv "fire/E-MTAB-/948/E-MTAB-11948/Files/sample5features.tsv.gz" ./
ascp -P33001 -i "C:/aspera/cli/etc/asperaweb_id_dsa.openssh" --host=fasp.ebi.ac.uk --user=bsaspera --mode=recv "fire/E-MTAB-/948/E-MTAB-11948/Files/sample3barcodes.tsv.gz" ./
ascp -P33001 -i "C:/aspera/cli/etc/asperaweb_id_dsa.openssh" --host=fasp.ebi.ac.uk --user=bsaspera --mode=recv "fire/E-MTAB-/948/E-MTAB-11948/Files/sample5matrix.mtx.gz" ./
ascp -P33001 -i "C:/aspera/cli/etc/asperaweb_id_dsa.openssh" --host=fasp.ebi.ac.uk --user=bsaspera --mode=recv "fire/E-MTAB-/948/E-MTAB-11948/Files/Sample2.csv" ./
ascp -P33001 -i "C:/aspera/cli/etc/asperaweb_id_dsa.openssh" --host=fasp.ebi.ac.uk --user=bsaspera --mode=recv "fire/E-MTAB-/948/E-MTAB-11948/Files/sample4barcodes.tsv.gz" ./
ascp -P33001 -i "C:/aspera/cli/etc/asperaweb_id_dsa.openssh" --host=fasp.ebi.ac.uk --user=bsaspera --mode=recv "fire/E-MTAB-/948/E-MTAB-11948/Files/sample4matrix.mtx.gz" ./
ascp -P33001 -i "C:/aspera/cli/etc/asperaweb_id_dsa.openssh" --host=fasp.ebi.ac.uk --user=bsaspera --mode=recv "fire/E-MTAB-/948/E-MTAB-11948/Files/Sample1.csv" ./
ascp -P33001 -i "C:/aspera/cli/etc/asperaweb_id_dsa.openssh" --host=fasp.ebi.ac.uk --user=bsaspera --mode=recv "fire/E-MTAB-/948/E-MTAB-11948/Files/sample5barcodes.tsv.gz" ./
ascp -P33001 -i "C:/aspera/cli/etc/asperaweb_id_dsa.openssh" --host=fasp.ebi.ac.uk --user=bsaspera --mode=recv "fire/E-MTAB-/948/E-MTAB-11948/Files/sample1barcodes.tsv.gz" ./
ascp -P33001 -i "C:/aspera/cli/etc/asperaweb_id_dsa.openssh" --host=fasp.ebi.ac.uk --user=bsaspera --mode=recv "fire/E-MTAB-/948/E-MTAB-11948/Files/sample6matrix.mtx.gz" ./
ascp -P33001 -i "C:/aspera/cli/etc/asperaweb_id_dsa.openssh" --host=fasp.ebi.ac.uk --user=bsaspera --mode=recv "fire/E-MTAB-/948/E-MTAB-11948/Files/sample3matrix.mtx.gz" ./
ascp -P33001 -i "C:/aspera/cli/etc/asperaweb_id_dsa.openssh" --host=fasp.ebi.ac.uk --user=bsaspera --mode=recv "fire/E-MTAB-/948/E-MTAB-11948/Files/sample6features.tsv.gz" ./
ascp -P33001 -i "C:/aspera/cli/etc/asperaweb_id_dsa.openssh" --host=fasp.ebi.ac.uk --user=bsaspera --mode=recv "fire/E-MTAB-/948/E-MTAB-11948/Files/sample2features.tsv.gz" ./
ascp -P33001 -i "C:/aspera/cli/etc/asperaweb_id_dsa.openssh" --host=fasp.ebi.ac.uk --user=bsaspera --mode=recv "fire/E-MTAB-/948/E-MTAB-11948/Files/sample2matrix.mtx.gz" ./
ascp -P33001 -i "C:/aspera/cli/etc/asperaweb_id_dsa.openssh" --host=fasp.ebi.ac.uk --user=bsaspera --mode=recv "fire/E-MTAB-/948/E-MTAB-11948/Files/sample6barcodes.tsv.gz" ./
ascp -P33001 -i "C:/aspera/cli/etc/asperaweb_id_dsa.openssh" --host=fasp.ebi.ac.uk --user=bsaspera --mode=recv "fire/E-MTAB-/948/E-MTAB-11948/Files/E-MTAB-11948.idf.txt" ./
ascp -P33001 -i "C:/aspera/cli/etc/asperaweb_id_dsa.openssh" --host=fasp.ebi.ac.uk --user=bsaspera --mode=recv "fire/E-MTAB-/948/E-MTAB-11948/Files/E-MTAB-11948.sdrf.txt" ./
上面的命令我们还需要简单的修改一下,主要是修改其中的 aspera秘钥文件的路径:C:/aspera/cli/etc/asperaweb_id_dsa.openssh
服务器上的秘钥路径:
# 安装了aspera的conda 环境rna
conda activate rna
# 查找路径
find /nas2/zhangj/biosoft/miniconda3/envs/rna/ -name '*asperaweb_id_dsa.openssh'
# /nas2/zhangj/biosoft/miniconda3/envs/rna/etc/asperaweb_id_dsa.openssh
# 批量替换上面的
sed -i 's#C:/aspera/cli/etc/asperaweb_id_dsa.openssh#/nas2/zhangj/biosoft/miniconda3/envs/rna/etc/asperaweb_id_dsa.openssh#g' E-MTAB-11948-unix-aspera.sh
测试一个看看,哈报错了:
ascp -vQT -l 500m -k 1 -P33001 -i "/nas2/zhangj/biosoft/miniconda3/envs/rna/etc/asperaweb_id_dsa.openssh" --host=fasp-beta.ebi.ac.uk --user=bsaspera --mode=recv "fire/E-MTAB-/948/E-MTAB-11948/Files/Sample4.csv" ./
ascp: failed to authenticate, exiting.
Session Stop (Error: failed to authenticate)
得到每个文件的ftp下载链接:
https://ftp.ebi.ac.uk/biostudies/fire/E-MTAB-/948/E-MTAB-11948/Files/E-MTAB-11948.sdrf.txt
发现这里面的ftp地址无效!晕死!
wget ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/MTAB/E-MTAB-11948/E-MTAB-11948.processed.9.zip
--2025-02-04 22:31:02-- ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/MTAB/E-MTAB-11948/E-MTAB-11948.processed.9.zip
=> “E-MTAB-11948.processed.9.zip”
正在解析主机 ftp.ebi.ac.uk (ftp.ebi.ac.uk)... 193.62.193.165
正在连接 ftp.ebi.ac.uk (ftp.ebi.ac.uk)|193.62.193.165|:21... 已连接。
正在以 anonymous 登录 ... 登录成功!
==> SYST ... 完成。 ==> PWD ... 完成。
==> TYPE I ... 完成。 ==> CWD (1) /pub/databases/microarray/data/experiment/MTAB/E-MTAB-11948 ...
目录 “pub/databases/microarray/data/experiment/MTAB/E-MTAB-11948” 不存在。
从这里获取文件下载地址吧:https://ftp.ebi.ac.uk/biostudies/fire/E-MTAB-/948/E-MTAB-11948/Files/
,
地址规则为 :https://ftp.ebi.ac.uk/biostudies/fire/E-MTAB-/948/E-MTAB-11948/Files/
+ 文件名字
,如 https://ftp.ebi.ac.uk/biostudies/fire/E-MTAB-/948/E-MTAB-11948/Files/sample1barcodes.tsv.gz
。
得到下载链接文件名:
conda activate rna
# 安装一个 lftp 软件
wget -c https://ftp.ebi.ac.uk/biostudies/fire/E-MTAB-/948/E-MTAB-11948/Files/E-MTAB-11948.sdrf.txt
# 拿到样本文件名
less -S E-MTAB-11948.sdrf.txt |head -n 1 |tr '\t' '\n' |cat -n |grep 'Derived Array Data File'
# 45 Derived Array Data File
# 48 Derived Array Data File
# 51 Derived Array Data File
# 54 Derived Array Data File
# 得到上面那些列
less -S E-MTAB-11948.sdrf.txt |cut -f 45,48,51,54 |sed '1d' | tr '\t' '\n' >file.txt
保存在文件 file.txt 中:
sample1barcodes.tsv.gz
sample1features.tsv.gz
sample1matrix.mtx.gz
Sample1.csv
sample2barcodes.tsv.gz
sample2features.tsv.gz
sample2matrix.mtx.gz
Sample2.csv
sample3barcodes.tsv.gz
sample3features.tsv.gz
sample3matrix.mtx.gz
Sample3.csv
sample4barcodes.tsv.gz
sample4features.tsv.gz
sample4matrix.mtx.gz
Sample4.csv
sample5barcodes.tsv.gz
sample5features.tsv.gz
sample5matrix.mtx.gz
Sample5.csv
sample6barcodes.tsv.gz
sample6features.tsv.gz
sample6matrix.mtx.gz
Sample6.csv
下载:
cat file.txt |while read id
do
echo "axel -n 100 https://ftp.ebi.ac.uk/biostudies/fire/E-MTAB-/948/E-MTAB-11948/Files/$id"
done >down.sh
bash down.sh