Novel functional sequences uncovered through a bovine multiassembly graph
https://www.pnas.org/doi/10.1073/pnas.2101056118
牛PNAS2021.pdf
论文的代码链接
https://github.com/AnimalGenomicsETH/bovine-graphs
论文中用6个牛的基因组做了一些泛基因组相关的分析,使用的软件是minigraph。
其中第一步是用6个基因组数据做了一个进化树,进化树是用nj法,数据是遗传距离。遗传距离是用基因组数据来算的。使用到的软件是mash。
mash这个软件直接用conda就可以安装
conda install mash
论文中提供了这部分分析的代码,重复一下这个分析。数据使用的是6个拟南芥的基因组
论文中提供的代码链接是
https://github.com/AnimalGenomicsETH/bovine-graphs/blob/main/subworkflows/mash_distance.py
mash sketch -p 2 -o An1.fa.msh ../An-1.chr.all.v2.0.fasta
mash sketch -p 2 -o C24.fa.msh ../C24.chr.all.v2.0.fasta
mash sketch -p 2 -o Cvi.fa.msh ../Cvi.chr.all.v2.0.fasta
mash sketch -p 2 -o Kyo.fa.msh ../Kyo.chr.all.v2.0.fasta
mash sketch -p 2 -o Ler.fa.msh ../Ler.chr.all.v2.0.fasta
mash sketch -p 2 -o Sha.fa.msh ../Sha.chr.all.v2.0.fasta
mash paste combined_sketch.msh An1.fa.msh C24.fa.msh Cvi.fa.msh Kyo.fa.msh Ler.fa.msh Sha.fa.msh
mash dist combined_sketch.msh combined_sketch.msh > combined_distance.tsv
这个计算过程非常快
论文中提供的代码链接是
https://github.com/AnimalGenomicsETH/bovine-graphs/blob/main/scripts/phylo_tree_assembly.R
library(tidyverse)
library(ape)
library(ggtree)
disfile<-"combined_sketch.tsv"
datdis <- read.table(disfile,header=FALSE, stringsAsFactors =FALSE)
datdis
colnames(datdis) <- c("anim1","anim2","distr","comp4","comp5")
datdis %>%
mutate(anim1c=str_extract(anim1,pattern = "[A-z0-9]+"),
anim2c=str_extract(anim2,pattern = "[A-z0-9]+")) -> datdis
datsel <- datdis %>% select(anim1c,anim2c, distr)
datwide <- datsel %>% pivot_wider(names_from = anim2c, values_from = distr)
datmat <- as.matrix(datwide %>% select(-anim1c))
rownames(datmat) <- datwide$anim1c
datmat
tr <- nj(datmat)
new.tr<-root(tr,outgroup = "An")
ggtree(new.tr)+
geom_tiplab()+
xlim(NA,0.01)
相比于论文中的代码稍微有点改动