一共三部分 通过简单数据熟悉Linux下生物信息学各种操作1 通过简单数据熟悉Linux下生物信息学各种操作2 通过简单数据熟悉Linux下生物信息学各种操作3
cat src
curl -OL http://downloads.sourceforge.net/project/bio-bwa/bwa-0.7.17.tar.bz2
tar jxvf bwa-0.7.17.tar.bz2
cd bwa-0.7.17/
make
ln -s ~/src/bwa-0.7.17/bwa ~/bin
创建文件夹放references 获取1999爆发ebola的参考基因组
mkdir -p ~/refs/852
efetch -db nucleotide -id NC_002549 -format fasta > ~/refs/852/ebola-1999.fa
bwa index ~/refs/852/ebola-1999.fa
ls ~/refs/852
ebola-1999.fa ebola-1999.fa.ann ebola-1999.fa.pac
ebola-1999.fa.amb ebola-1999.fa.bwt ebola-1999.fa.sa
通过blast也可以搜寻到
makeblastdb -in ~/refs/852/ebola-1999.fa -dbtype nucl -parse_seqids
ls ~/refs/852
ebola-1999.fa ebola-1999.fa.nhr ebola-1999.fa.nsi
ebola-1999.fa.amb ebola-1999.fa.nin ebola-1999.fa.nsq
ebola-1999.fa.ann ebola-1999.fa.nog ebola-1999.fa.pac
ebola-1999.fa.bwt ebola-1999.fa.nsd ebola-1999.fa.sa
共有16个文件,这也是为什么刚才创建单独文件夹的原因 获取ebolas 基因组的前1行作为query序列
head -2 ~/refs/852/ebola-1999.fa > query.fa
现在用bwa-mem把上面的query map到它自己的基因组上去
cat results.sam
@SQ SN:NC_002549.1 LN:18959
@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem /Users/ucco/refs/852/ebola-1999.fa query.fa
NC_002549.1 0 NC_002549.1 1 60 70M * 0 0 CGGACACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAA * NM:i:0 MD:Z:70 AS:i:70 XS:i:0
比较用上面同样序列的blasting结果
blastn -db ~/refs/852/ebola-1999.fa -query query.fa > results.blastn
Query= NC_002549.1 Zaire ebolavirus isolate Ebola
virus/H.sapiens-tc/COD/1976/Yambuku-Mayinga, complete genome
Length=70
Score E
Sequences producing significant alignments: (Bits) Value
NC_002549.1 Zaire ebolavirus isolate Ebola virus/H.sapiens-tc/COD... 130 6e-34
>NC_002549.1 Zaire ebolavirus isolate Ebola virus/H.sapiens-tc/COD/1976/Yambuku-Mayinga,
complete genome
Length=18959
Score = 130 bits (70), Expect = 6e-34
Identities = 70/70 (100%), Gaps = 0/70 (0%)
Strand=Plus/Plus
Query 1 CGGACACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGA 60
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 1 CGGACACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGA 60
Query 61 AGATTAATAA 70
||||||||||
Sbjct 61 AGATTAATAA 70
Lambda K H
1.33 0.621 1.12
Gapped
Lambda K H
1.28 0.460 0.850
Effective search space used: 1079922
Database: /Users/ucco/refs/852/ebola-1999.fa
现在vi query.fa,增加或删除几个bases,然后进行比对 比如在第6个碱基开始添加一串T
cat query.fa
>NC_002549.1 Zaire ebolavirus isolate Ebola virus/H.sapiens-tc/COD/1976/Yambuku-Mayinga, complete genome
CGGACTTTTTTTTTACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAA
再次运行比对程序,结果如下
@SQ SN:NC_002549.1 LN:18959
@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem /Users/ucco/refs/852/ebola-1999.fa query.fa
NC_002549.1 0 NC_002549.1 6 60 14S65M * 0 CGGACTTTTTTTTTACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAANM:i:0 MD:Z:65 AS:i:65 XS:i:0
#之前没有修改bases的序列比对结果
@SQ SN:NC_002549.1 LN:18959
@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem /Users/ucco/refs/852/ebola-1999.fa query.fa
NC_002549.1 0 NC_002549.1 1 60 70M * 0 0 CGGACACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAA * NM:i:0 MD:Z:70 AS:i:70 XS:i:0
为了查看特定列,可以进行cut操作,要深入理解每列的意义 query id,alignment flag and target id
cat results.sam |cut -f 1,2,3
@SQ SN:NC_002549.1 LN:18959
@PG ID:bwa PN:bwa
NC_002549.1 0 NC_002549.1
比对的start,mapping quality,CIGAR string
cat results.sam |cut -f 4,5,6
VN:0.7.17-r1188 CL:bwa mem /Users/ucco/refs/852/ebola-1999.fa query.fa
1 60 70M
paired end read information,name ,position and length of template
cat results.sam |cut -f 7,8,9
* 0 0
query sequence, query quality, other attributes
cat results.sam |cut -f 10-14
CGGACACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAA * NM:i:0 MD:Z:70 AS:i:70
在这里下载示例数据
curl -O http://www.personal.psu.edu/iua1/courses/files/2014/data-14.tar.gz
tar xzvf data-14.tar.gz
会产生两个名字为read1.fq和read2.fq的文件 用bwa比对
bwa mem ~/refs/852/ebola-1999.fa read1.fq read2.fq > results.sam
部分结果如下
@SQ SN:NC_002549.1 LN:18959
@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem /Users/ucco/refs/852/ebola-1999.fa read1.fq read2.fq
gi|10313991|ref|NC_002549.1|_17166_17748_1:1:0_0:0:0_0 83 NC_002549.1 17679 60 70M = 17166 -583 AATTCAACGAAGCCCATACTGGCTAAGTCATTTAACTCAGTATGCTGACTGTGAGTTACATTTAAGTTAT 2222222222222222222222222222222222222222222222222222222222222222222222 NM:i:0 MD:Z:70 MC:Z:70M AS:i:70 XS:i:0
gi|10313991|ref|NC_002549.1|_17166_17748_1:1:0_0:0:0_0 163 NC_002549.1 17166 60 70M = 17679 583 CAGTCTAGAGTCAGAAATAGTATCAGGAATGACTACTCCTAGGATGCTTCTACCCGTTATGTCAAAATTC 2222222222222222222222222222222222222222222222222222222222222222222222 NM:i:2 MD:Z:4A49T15 MC:Z:70M AS:i:6XS:i:0
gi|10313991|ref|NC_002549.1|_4818_5334_1:0:0_3:0:0_1 99 NC_002549.1 4818 60 70M = 5265 517 GCCATCATGCTTGCTTCATACACTATCACCCAATTCGGCAAGGCAACCAATCCACTTGTCAGAGTCAATC 2222222222222222222222222222222222222222222222222222222222222222222222 NM:i:1 MD:Z:32T37 MC:Z:70M AS:i:6XS:i:0
gi|10313991|ref|NC_002549.1|_4818_5334_1:0:0_3:0:0_1 147 NC_002549.1 5265 60 70M = 4818 -517 GTGCCAGAAACTCTGGTCCTCAAGCTGACCGGTAAGAAGGTGACTTCTAAAATTCGACAACCAATCATCC 2222222222222222222222222222222222222222222222222222222222222222222222 NM:i:3 MD:Z:19A32A1G15 MC:Z:70M AS:i:5XS:i:0
gi|10313991|ref|NC_002549.1|_8927_9349_3:0:0_3:0:0_2 83 NC_002549.1 9280 60 70M = 8927 -423 GGTTCAGGGTTAAGAACATTGGTTCCTCAATCAGATAATGAGGAAGCTTCAACCAAACCGGGGAAATGCT 2222222222222222222222222222222222222222222222222222222222222222222222 NM:i:3 MD:Z:1T54C7C5 MC:Z:70M AS:i:5XS:i:0
为了简单明了表示,把上述的qurey加T后一起运行
cp query.fa query_addT.fa
vi query_addT.fa
bwa mem ~/refs/852/ebola-1999.fa query.fa query_addT.fa> xresults.sam
结果如下
@SQ SN:NC_002549.1 LN:18959
@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem /Users/ucco/refs/852/ebola-1999.fa query.fa query_addT.fa
NC_002549.1 65 NC_002549.1 1 60 70M = 6 6 CGGACACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAA * NM:i:0 MD:Z:70 MC:Z:15S65M AS:i:70 XS:i:0
NC_002549.1 129 NC_002549.1 6 60 15S65M = 1 -6 CGGACTTTTTTTTTTACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAA * NM:i:0 MD:Z:65 MC:Z:70M AS:i:65 XS:i:0
这部分暂时隔过
mkdir -p readseq
cd ~/src/readseq
curl -OL http://iubio.bio.indiana.edu/soft/molbio/readseq/java/readseq.jar
我打不开这个网站,然后google下载的. 因为readseq调用的方式和前面安装的trimmomatic相似,所以cp
cp ~/bin/trimmomatic ~/bin/readseq
vi ~/bin/readseq
改成下图即可
vi_readseq
http://www.ncbi.nlm.nih.gov/genome/4887 进入lec文件夹,随便你以前用其他名字命名的 获取complete data
efetch -db nucleotide -id NC_002549.1 -format gb > NC.gb
格式为GFF(Generic Feature Format)文件
readseq -format=GFF NC.gb
# 也可以设定输出名字
readseq -format=GFF -o NC-all.gff NC.gb
会有以下几个文件
4files
提取到fasta 文件
readseq -format=FASTA -o NC.fa NC.gb
先查看一下gff文件
cat NC-all.gff |head
##gff-version 2
# seqname source feature start end score strand frame attributes
NC_002549 - source 1 18959 . + . organism "Zaire ebolavirus" ; mol_type "viral cRNA" ; isolate "Ebola virus/H.sapiens-tc/COD/1976/Yambuku-Mayinga" ; db_xref "taxon:186538"
NC_002549 - 5'UTR 1 55 . + . note "leader region" ; citation "[1]" ; function "regulation or initiation of RNA replication"
NC_002549 - gene 56 3026 . + . gene "NP" ; locus_tag "ZEBOVgp1" ; db_xref "GeneID:911830"
NC_002549 - mRNA 56 3026 . + . gene "NP" ; locus_tag "ZEBOVgp1" ; product "nucleoprotein" ; db_xref "GeneID:911830"
NC_002549 - regulatory 56 67 . + . regulatory_class "other" ; gene "NP" ; locus_tag "ZEBOVgp1" ; note "putative; transcription start signal" ; citation "[1]"
NC_002549 - CDS 470 2689 . + . gene "NP" ; locus_tag "ZEBOVgp1" ; function "encapsidation of genomic RNA" ; codon_start 1 ; product "nucleoprotein" ; protein_id "NP_066243.1" ; db_xref "GeneID:911830" ; translation "MDSRPQKIWMAPSLTESDMDYHKILTAGLSVQQGIVRQRVIPVYQVNNLEEICQLIIQAFEAGVDFQESADSFLLMLCLHHAYQGDYKLFLESGAVKYLEGHGFRFEVKKRDGVKRLEELLPAVSSGKNIKRTLAAMPEEETTEANAGQFLSFASLFLPKLVVGEKACLEKVQRQIQVHAEQGLIQYPTAWQSVGHMMVIFRLMRTNFLIKFLLIHQGMHMVAGHDANDAVISNSVAQARFSGLLIVKTVLDHILQKTERGVRLHPLARTAKVKNEVNSFKAALSSLAKHGEYAPFARLLNLSGVNNLEHGLFPQLSAIALGVATAHGSTLAGVNVGEQYQQLREAATEAEKQLQQYAESRELDHLGLDDQEKKILMNFHQKKNEISFQQTNAMVTLRKERLAKLTEAITAASLPKTSGHYDDDDDIPFPGPINDDDNPGHQDDDPTDSQDTTIPDVVVDPDDGSYGEYQSYSENGMNAPDDLVLFDLDEDDEDTKPV**RSTKGGQQKNSQKGQHIEGRQTQSRPIQNVPGPHRTIHHASAPLTDNDRRNEPSGSTSPRMLTPINEEADPLDDADDETSSLPPLESDDEEQDRDGTSNRTPTVAPPAPVYRDHSEKKELPQDEQQDQDHTQEARNQDSDNTQSEHSFEEMYRHILRSQGPFDAVLYYHMMKDEPVVFSTSDGKEYTYPDSLEEEYPPWLTEKEAMNEENRFVTLDGQQFYWPVMNHKNKFMAILQHHQ"
NC_002549 - regulatory 3015 3026 . + . regulatory_class "polyA_signal_sequence" ; gene "NP" ; locus_tag "ZEBOVgp1"
NC_002549 - misc_feature 3027 3031 . + . note "intergenic region"
只获取gff文件中的gene rows
cat NC-all.gff |egrep '\tgene\t'
NC_002549 - gene 56 3026 . + . gene "NP" ; locus_tag "ZEBOVgp1" ; db_xref "GeneID:911830"
NC_002549 - gene 3032 4407 . + . gene "VP35" ; locus_tag "ZEBOVgp2" ; db_xref "GeneID:911827"
NC_002549 - gene 4390 5894 . + . gene "VP40" ; locus_tag "ZEBOVgp3" ; db_xref "GeneID:911825"
NC_002549 - gene 5900 8305 . + . gene "GP" ; locus_tag "ZEBOVgp4" ; db_xref "GeneID:911829"
NC_002549 - gene 8288 9740 . + . gene "VP30" ; locus_tag "ZEBOVgp5" ; db_xref "GeneID:911826"
NC_002549 - gene 9885 11518 . + . gene "VP24" ; locus_tag "ZEBOVgp6" ; note "putative" ; db_xref "GeneID:911828"
NC_002549 - gene 11501 18282 . + . gene "L" ; locus_tag "ZEBOVgp7" ; db_xref "GeneID:911824"
实际上我们还想要前两行的注释行
cat NC-all.gff |head -2 > NC-genes.gff
cat NC-all.gff | egrep '\tgene\t' >> NC-genes.gff
cat NC-genes.gff
##gff-version 2
# seqname source feature start end score strand frame attributes
NC_002549 - gene 56 3026 . + . gene "NP" ; locus_tag "ZEBOVgp1" ; db_xref "GeneID:911830"
NC_002549 - gene 3032 4407 . + . gene "VP35" ; locus_tag "ZEBOVgp2" ; db_xref "GeneID:911827"
NC_002549 - gene 4390 5894 . + . gene "VP40" ; locus_tag "ZEBOVgp3" ; db_xref "GeneID:911825"
NC_002549 - gene 5900 8305 . + . gene "GP" ; locus_tag "ZEBOVgp4" ; db_xref "GeneID:911829"
NC_002549 - gene 8288 9740 . + . gene "VP30" ; locus_tag "ZEBOVgp5" ; db_xref "GeneID:911826"
NC_002549 - gene 9885 11518 . + . gene "VP24" ; locus_tag "ZEBOVgp6" ; note "putative" ; db_xref "GeneID:911828"
NC_002549 - gene 11501 18282 . + . gene "L" ; locus_tag "ZEBOVgp7" ; db_xref "GeneID:911824"
NC_002549 - gene 56 3026 . + . gene "NP" ; locus_tag "ZEBOVgp1" ; db_xref "GeneID:911830"
NC_002549 - gene 3032 4407 . + . gene "VP35" ; locus_tag "ZEBOVgp2" ; db_xref "GeneID:911827"
NC_002549 - gene 4390 5894 . + . gene "VP40" ; locus_tag "ZEBOVgp3" ; db_xref "GeneID:911825"
NC_002549 - gene 5900 8305 . + . gene "GP" ; locus_tag "ZEBOVgp4" ; db_xref "GeneID:911829"
NC_002549 - gene 8288 9740 . + . gene "VP30" ; locus_tag "ZEBOVgp5" ; db_xref "GeneID:911826"
NC_002549 - gene 9885 11518 . + . gene "VP24" ; locus_tag "ZEBOVgp6" ; note "putative" ; db_xref "GeneID:911828"
NC_002549 - gene 11501 18282 . + . gene "L" ; locus_tag "ZEBOVgp7" ; db_xref "GeneID:911824"
染色体坐标不匹配,需要修复 重新索引并且重新比对
cp NC.fa ~/refs/852/
bwa index ~/refs/852/NC.fa
[bwa_index] Pack FASTA... 0.00 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 0.01 seconds elapse.
[bwa_index] Update BWT... 0.00 sec
[bwa_index] Pack forward-only FASTA... 0.00 sec
[bwa_index] Construct SA from BWT and Occ... 0.00 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index /Users/ucco/refs/852/NC.fa
[main] Real time: 0.113 sec; CPU: 0.015 sec
再align 注意我的命名有点乱,最好去完全按原文中的命名 下面的align.sh我没找到
cp ../lec4/*.fq .
bash align.sh read1.fq read2.fq results.bam
载入IGV,看100-150bp区域的深度