【4.1.2】Annovar--位点注释

已知位点突变后,我们需要对其功能进行注释。Annovar对这些变异位点进行注释, 得到一个易于理解的变异位点列表

一、简介

官网:http://annovar.openbioinformatics.org/en/latest/ 可注释的基因组包括:human genome hg18, hg19, hg38, as well as mouse, worm, fly, yeast and many others 注释的内容包括:

  • Gene-based: 判断SNP或CNV是否引起蛋白编码的改变或氨基酸的改变
  • Region-based: 判断突变是否在某些特殊的区域
  • Filter-based: 标注特殊数据库的一些属性
  • 其他功能

软件下载地址:http://bejerano.stanford.edu/MCAP/ (需要学校的邮箱才可以下载)

二、下载数据库

  •  最新数据库地址:http://annovar.openbioinformatics.org/en/latest/

  • 数据库下载说明:http://annovar.openbioinformatics.org/en/latest/user-guide/download/

方式一:通过annotate_variation

perl annotate_variation.pl -downdb -buildver hg19 -webfrom annovar mcap humandb/

mcap下载的库
awk '{if(NR!=1) print $1 "\t" $2 "\t" $2}' mcap_v1_0.txt >mcap_v1_0_change.txt
perl Annovar_index.pl mcap_v1_0_change.txt 500
perl annotate_variation.pl -downdb -buildver hg19 -webfrom annovar dbscsnv11 humandb/

方式二:通过aria2c

aria2c http://www.openbioinformatics.org/annovar/download/hg19_dbnsfp33a.txt.idx.gz
aria2c http://www.openbioinformatics.org/annovar/download/hg19_dbnsfp33a.txt.gz

方式三:自己构建数据库,建立索引

丛JF来得到的 revel_all_chromosomes.csv

awk -F "," '{if(NR==1){print "#" $1 "\t" $2 "\t" $2 "\t" $3 "\t" $4 "\t" $5 "\t" $6 "\t" $7 } else {print $1 "\t" $2 "\t" $2 "\t" $3 "\t" $4 "\t" $5 "\t" $6 "\t" $7 } }' revel_all_chromosomes.csv >hg19_revel.txt
perl Annovar_index.pl hg19_revel.txt 1000

经过几次下载,大一点的数据库,我更推崇方式二,毕竟下载多少,速度咋样,是可以看到的

三、注释

/bioinfo/software/bin/table_annovar.pl -buildver hg19 -protocol mcap -operation g,r,f -nastring . -remove -otherinfo --vcfinput S1570169.g.vcf /home/qqin/download/annovar/humandb

说明:

The -operation argument tells ANNOVAR which operations to use for each of the protocols:
g means gene-based,
r means region-based
f means filter-based.

3.1 gene-based-annoation

-operation为g

qqin@lizard:[program]$head /bioinfo/software/packages/annovar-2015.06.17/humandb/hg19_refGene.txt
19 NM_001291929 chr11 - 89057521 89223909 89059923 89223852 17 89057521,89069012,89070614,89073230,89075241,89088129,89106599,89133184,89133382,89135493,89155069,89165951,89173855,89177302,89182607,89184952,89223774, 89060044,89069113,89070683,89073339,89075361,89088211,89106660,89133247,89133547,89135710,89155150,89166024,89173883,89177400,89182692,89185063,89223909, 0 NOX4 cmpl cmpl 2,0,0,2,2,1,0,0,0,2,2,1,0,1,0,0,0,985 NM_016039 chr14 + 52456227 52471420 52456357 52471234 8 52456227,52458034,52460440,52465211,52466425,52468515,52470911,52471079, 524

3.2 region-based-annoation

-operation为r

qqin@lizard:[program]$head /bioinfo/software/packages/annovar-2015.06.17/humandb/hg19_cytoBand.txt
chr1 0 2300000 p36.33 gneg
chr1 2300000 5400000 p36.32 gpos25
chr1 5400000 7200000 p36.31 gneg
chr1 7200000 9200000 p36.23 gpos25
chr1 9200000 12700000 p36.22 gneg

3.3Filter-bassed-annoation

-operation为f

qqin@lizard:[program]$head /bioinfo/software/packages/annovar-2015.06.17/humandb/hg19_ljb26_all.txt


#Chr Start End Ref Alt SIFT_score SIFT_pred Polyphen2_HDIV_score Polyphen2_HDIV_pred Polyphen2_HVAR_score Polyphen2_HVAR_pred LRT_score LRT_pred MutationTaster_score MutationTaster_pred MutationAssessor_score MutationAssessor_pred FATHMM_score FATHMM_pred RadialSVM_score RadialSVM_pred LR_score LR_pred VEST3_score CADD_raw CADD_phred GERP++_RS phyloP46way_placental phyloP100way_vertebrate SiPhy_29way_logOdds
1 35138 35138 T A . . . . . . . . 1.000 N . . . . . .. . . -0.886 0.467 0.742 0.593 0.339 3.824
1 35138 35138 T G . . . . . . . . 1.000 N . . . . . .. . . -0.996 0.267 0.742 0.593 0.339 3.824
1 35139 35139 T A . . . . . . . . 1.000 N

LJB* (dbNSFP) non-synonymous variants annotation

这个数据库包括SIFT scores, PolyPhen2 HDIV scores, PolyPhen2 HVAR scores, LRT scores, MutationTaster scores, MutationAssessor score, FATHMM scores, GERP++ scores, PhyloP scores and SiPhy scores

为了以后更新的方便,这个库现一更名为dbnsfp30a

annotate_variation.pl -downdb -webfrom annovar -buildver hg19 dbnsfp30a humandb/ 
table_annovar.pl ex1.avinput humandb/ -protocol dbnsfp30a -operation f -build hg19 -nastring .

如果想单独注释其中的某一个数据库,可以单独下载该数据库The keyword used for downloading these data include: ljb23_sift, ljb23_pp2hdiv, ljb23_pp2hvar, ljb23_lrt, ljb23_mt, ljb23_ma, ljb23_fathmm, ljb23_metasvm, ljb23_metalr, ljb23_gerp++, ljb23_phylop, ljb23_siphy, ljb23_all. The ljb23_all includes ALL scores, and it is very useful in table_annovar.pl.

LJB23注释结果详解

Score (dbtype) # variants in LJB23 build hg19 Categorical Prediction
SIFT (sift) 77593284 D: Deleterious (sift<=0.05); T: tolerated (sift>0.05)
PolyPhen 2 HDIV (pp2_hdiv) 72533732 D: Probably damaging (>=0.957), P: possibly damaging (0.453<=pp2_hdiv<=0.956); B: benign (pp2_hdiv<=0.452)
PolyPhen 2 HVar (pp2_hvar) 72533732 D: Probably damaging (>=0.909), P: possibly damaging (0.447<=pp2_hdiv<=0.909); B: benign (pp2_hdiv<=0.446)
LRT (lrt) 68069321 D: Deleterious; N: Neutral; U: Unknown
MutationTaster (mt) 88473874 A” (“disease_causing_automatic”); “D” (“disease_causing”); “N” (“polymorphism”); “P” (“polymorphism_automatic”
MutationAssessor (ma) 74631375 H: high; M: medium; L: low; N: neutral. H/M means functional and L/N means non-functional
MetaSVM (metasvm) 82098217 D: Deleterious; T: Tolerated
MetaLR (metalr) 82098217 D: Deleterious; T: Tolerated
GERP++ (gerp++) 89076718 higher scores are more deleterious
PhyloP (phylop) 89553090 higher scores are more deleterious
SiPhy (siphy) 88269630 higher scores are more deleterious

ljb2_pp2hvar被用于孟德尔疾病的诊断,which requires distinguishing mutations with drastic effects from all the remaining human variation, including abundant mildly deleterious alleles

ljb2_pp2hdiv被用于 evaluating rare alleles at loci potentially involved in complex phenotypes, dense mapping of regions identified by genome-wide association studies, and analysis of natural selection from sequence data

参考资料:

http://doc-openbio.readthedocs.io/projects/annovar/en/latest/user-guide/filter/?highlight=sift

药企,独角兽,苏州。团队长期招人,感兴趣的都可以发邮件聊聊:tiehan@sina.cn
个人公众号,比较懒,很少更新,可以在上面提问题,如果回复不及时,可发邮件给我: tiehan@sina.cn