【4.1】ACMG检验标准以及InterVar自动化实现
一、ACMG遗传变异分类的由来
- 2000年,ACMG recommendation for standards for interpretation of sequence variations.
- 2008年,ACMG recommendations for standards for interpretation and reporting of sequence variations: Revisions 2007.
- 2015年联合分子病理学学会(AMP)和美国病理学家学会发表: Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology.
二、ACMG遗传变异分类的使用范围和标准
- ACMG 遗传变异分类仅适用于孟德尔遗传病致病基因的变异评价,不能用于复杂疾病,如大部分非遗传性肿瘤和重大老年疾病。
- ACMG遗传变异分类结果适用于germline variant 疾病风险预测或疾病诊断,不适合somatic variant。
- .异分为五类: “Pathogenic,” “Likely pathogenic,” “Uncertain significance,” “Likely benign,” and “Benign” 。
三、ACMG遗传变异分类标准的步骤
1. 研究证据分级的维度
2.1 Population data
1)权威数据库: Exome Sequencing Project, 1000 Genomes Project,Exome Aggregation Consortium, etc.
2)文献报道:Meta-analysis, Cohort study, Case control study, Case series study (RR , OR, Prevalence, Frequency )
eg: PS4, BA1
2.2 Computational and predictive data
数据来源:PolyPhen-2, ConSurf, FATHMM, SIFT, CADD, GeneSplicer,Human Splicing Finder, GERP, PhastCons , etc.
eg: PVS1, BP4
2.3 Function data
数据来源:Function studies
eg: PS3, BS3
2.4 Segregation data
数据来源:Sequencing data, literatures
eg: PP1, BS4
2.5 De novo data
数据来源: Sequencing data, literatures
eg: PS2,PM6
2.6 Allelic data
数据来源: Sequencing data, literatures
eg: PM3, BP2
2.7 Other database
数据来源: HGMD, Clinvar, OMIM, Human Genome Variation Society, etc.
eg: PP5, BP6
2.8 Other data
数据来源: Sequencing data, literatures
eg: PP4, BP5
2. 遗传变异的研究证据分级
-
idence of pathogenicity:
Very strong: PVS1 Strong: PS1, PS2, PS3, PS4 Moderate: PM1, PM2, PM3, PM4, PM5, PM6 Supporting: PP1, PP2, PP3, PP4, PP5
-
Evidence of benign
Stand-alone: BA1 Strong: BS1, BS2, BS3, BS4 Supporting: BP1, BP2, BP3, BP4, BP5, BP6, BP7
3.遗传变异的研究证据的分级说明
非常强烈的致病证据(PVS: very strong pathogenicity):
PVS1:null variant(无义突变,移码突变,剪接位点(+-1/2) 变异,初始密码子突变,单个或多个外显子缺失),前提是该基因的功能缺失是致病因素。
强烈的致病证据(PS: strong pathogenicity):
PS1: 为已经证实Pathogenic突变的同义突变(例如:某位点G>C为Pathogenic位点,对应的Val ->Leu为突变位点,那么该位点上G>T,也是Pathogenic位点)
PS2 :在无家族史的病人中检测到的de novo variant (父母需做测序鉴定确认)。
PS3:体内和体外功能研究均证实变异对基因和基因产物有损害效应。
注:功能研究已经验证,并在临床诊断实验室中证明是可重复的,则被认为是充分的证据。
PS4: 变异在受影响个体中的发生率显著高于对照人群中的发生率。
注1:病例对照研究中,相对风险或OR>5.0,且置信区间内不包括1.0。
注2:当携带变异的个体很少,在病例对照研究没有达到统计学意义;前期研究中多个具有相同
表型且不相关的患者发现该变异,而在对照人群中未发现该突变,则可认为是中等水平的证据。
较强的致病证据(PM: moderate pathogenicity)
PM1:位于突变热点区或者非常明确的功能结构域,且该区域无任何良性突变。如酶的活性区域。
PM2:在ESP数据库中(Exome Sequencing Project),千人数据库(1000 Genomes Project)及 EAC数据库(Exome AggregationConsortium)中的等位基因频率为0,对于隐性遗传的突变频率允许是低频。
PM3 :对于隐性遗传疾病,在变异位点所在的反式基因上发现致病性突变。注:需要检测父母(或后代)确认
PM4:由于非repeat区域的插入和删除或stop-loss 突变导致蛋白质长度的改变
PM5:错义突变所在的同一位置存在其他不同氨基酸变化的致病突变,如Arg156His是已经明确的致病突变,现出现了Arg156Cys,该突变便存在PM5证据。
PM6:De novo variant,但是没有进行父母亲子鉴定。
支持的致病证据(PP: supportingpathogenicity)
PP1:在发病的家族成员中该变异与疾病呈现共分离,并且该变异所在基因被认为可以致病。
PP2:错义突变所在的基因为良性突变的比例很低,且错义突变跟该疾病有一定的发病机制
PP3:多项计算机模拟计算预测有害,如保守区域、进化、剪接影响等。
PP4:患者的表型或家族病史对一个单一遗传因素的疾病具有高度特异性。
PP5:权威的研究和数据库支持该变异是致病的,但缺乏实验室独立评估的证据。
BA1: Exome Sequencing Project, 1000 Genomes Project, Exome Aggregation Consortium数据库中该等位基因频率>5%
BS1:Allele frequency is greater than expected for disorder
BS2:Observed in a healthy adult individual for a recessive (homozygous), dominant (heterozygous), or X-linked (hemizygous) disorder, with full penetrance expected at an early age
BS3:体内或体外功能研究已表明变异对蛋白功能无有害影响
BS4:变异在一个家族的发病成员中缺乏分离。
BP1: Missense variant in a gene for which primarily truncating variants are known to cause diseas
BP2 :对于一个显性遗传疾病,在变异所在的反式基因上发现致病性变异。或者对于任何遗传类型,在变异所在的顺式基因上发现致病性变异。
BP4:计算机预测证据显示变异对基因或基因产物无影响 。
注:由于许多算法使用相同或相似的数据进行预测,每一种算法不能被认为是一条独立的标准。
评价任何一种变异BP4只能使用一次。
BP5:变异在一个病例中被发现,该疾病中存在可替代的分子机制。
BP6:权威研究和数据库评定变异为良性,但缺乏实验室独立评估的证据。
BP7:A synonymous (silent) variant for which splicing prediction algorithms predict no impact to the splice consensus sequence nor the creation of a new splice site AND the nucleotide is not highly conserved
###6.遗传变异的分类
Pathogenic
(i)1 Very strong (PVS1) AND
(a) ≥1 Strong (PS1–PS4) OR
(b) ≥2 Moderate (PM1–PM6) OR
(c) 1 Moderate (PM1–PM6) and 1 supporting
(PP1–PP5) OR
(d) ≥2 Supporting (PP1–PP5)
(ii) ≥2 Strong (PS1–PS4) OR
(iii) 1 Strong (PS1–PS4) AND
(a)≥3 Moderate (PM1–PM6) OR
(b)2 Moderate (PM1–PM6) AND ≥2
Supporting (PP1–PP5) OR
(c)1 Moderate (PM1–PM6) AND ≥4
supporting (PP1–PP5)
Likely pathogenic
(i) 1 Very strong (PVS1) AND 1 moderate (PM1–PM6) OR
(ii) 1 Strong (PS1–PS4) AND 1–2 moderate(PM1–PM6) OR
(iii) 1 Strong (PS1–PS4) AND ≥2 supporting(PP1–PP5) OR
(iv) ≥3 Moderate (PM1–PM6) OR
(v) 2 Moderate (PM1–PM6) AND ≥2 supporting (PP1–PP5) OR
(vi) 1 Moderate (PM1–PM6) AND ≥4 supporting (PP1–PP5)
Benign
(i) 1 Stand-alone (BA1) OR
(ii) ≥2 Strong (BS1–BS4)
Likely benign
(i) 1 Strong (BS1–BS4) and 1 supporting (BP1–BP7) OR
(ii) ≥2 Supporting (BP1–BP7)
Uncertain significance
(i) Other criteria shown above are not met OR
(ii) the criteria for benign
四、InterVar的助攻
根据ACMG的分类标准,一共有28项指标,其中18项指标可以计算机自动注释出来,另外10个需要人工来校正
注释:
perl table_annovar.pl input.vcf humandb/ -buildver
hg19 -remove -out output -protocol refGene,esp6500siv2_all,
1000g2015aug_all,avsnp144,dbnsfp30a,clinvar_20160302,exac03,
dbscsnv11,dbnsfp31a_interpro,rmsk,ensGene,knownGene -operation
g,f,f,f,f,f,f,f,f,r,g,g -nastring. -vcfinput
4.1 数据库
需要的数据库:refGene esp6500siv2_all 1000g2015aug avsnp147 dbnsfp33a clinvar_20161128 exac03 dbscsnv11 dbnsfp31a_interpro rmsk ensGene knownGene
除了rmsk需要这么下载:
perl annotate_variation.pl -buildver hg19 -downdb rmsk humandb/
其他的都可以这么来下载:
perl annotate_variation.pl -downdb -buildver hg19 -webfrom annovar refGene humandb/
用annovar进行注释,注释所对应的库有avsnp147,refGene,esp6500siv2_all,1000g2015aug_all,exac03,clinvar_20161128,dbnsfp33a,dbscsnv11,dbnsfp31a_interpro,rmsk,ensGene,knownGene ,库的下载地址:http://annovar.openbioinformatics.org/en/latest/user-guide/download/
avsnp147 分配dbSNP
refGene 获知Gene的名字(RefSeq)
esp6500siv2_all 获知allele frequency(NHLBI Exome Sequencing Project,ESP6500)
1000g2015aug_all 获得alternative allele frequency(AAF) in the 1000 Genomes Project (version August 2015)
exac03 获得AAF(Exome Aggregation)Consortium (ExAC) Browser (version 0.3)
clinvar_20160302 致病位点数据库reported in ClinVar20 (version 20160302)
dbnsfp33a 基因功能预测 functional deleteriousness prediction scores from dbNSFP29 (whole-exome SIFT, PolyPhen2 HDIV, PolyPhen2 HVAR, LRT, MutationTaster, MutationAssessor, FATHMM, PROVEAN, MetaSVM, MetaLR, VEST, M-CAP, CADD, GERP++, DANN, fathmm-MKL, Eigen, GenoCanyon, fitCons, PhyloP and SiPhy scores from dbNSFP version 3.3a)
dbnsfp31a_interpro 结构域注释(dbNSFP29,30 and InterPro (integrates information about protein families, domains, and functional sites),
dbscsnv11 剪切位点预测(splicing impact by Ada Boost and Random Forest)
rmsk repeat区域(UCSC Genome Browser)
ensGene 基因注释(Ensembl)
knownGene 基因注释(UCSC Known Genes)
4.2 InterVar的注释
上面提到ACMG一共包含了28个证据,其中18(PVS1, PS1, PS4,PM1, PM2, PM4, PM5, PP2, PP3, PP5, BA1, BS1, BS2, BP1, BP3,BP4, BP6, and BP7)是可以自动注释出来,剩下的10个(PS2, PS3, PM3, PM6, PP1, PP4, BS3, BS4, BP2,BP5) 需要人工来补充
1.PVS1 (自动注释)
null variants(nonsense variants, frameshift indels,canonical splice variants)很容易导致功能的丢失(loss of function,LOF)。通过ANNOVAR的注释,LOF variants代表frameshift indel, stop-gain, stop-loss, and splicing variants in canonical transcripts。
通过分析Clinvar和ExAC Browser中为null variants的Pathgenic位点所属基因,一共找到了4807个基因(这些位点通过RefGene注释的基因名)。
We first filtered ClinVar (version 20160302) by taking those variants shown in MedGen and then removing common variants (allele frequencies > 5%) and variants with conflicting annotations. The variants in ClinVar were annotated by ANNOVAR with RefGene definitions, and we identified 1,988 genes harboring at least one LOF variant that is ‘‘pathogenic’’ in ClinVar. Recently, the ExAC analyzed high-quality exome (protein-coding region) DNA sequence data for 60,706 individuals and identified 3,230 genes as LOF intolerant.We combined these two gene sets from ClinVar and the ExAC Browser and generated 4,807 genes as our final LOF-intolerant gene list. Null variants in the canonical transcripts for these 4,807 genes were assigned a PVS1 of 1. However, on the basis of the canonical rules for nonsense-mediated mRNA decay,33 we did not consider nonsense variants that are downstream of or within 50 nucleotides of the final exon-junction complex.
PVS_t1:'Func.refGene和'ExonicFunc.refGene'中包含"nonsense", "frameshift", "splic", "stopgain",且不包含nonframe
PVS_t2:InverVar整理的一个包含Clinvar和ExAC Browser的基因列表
PVS_t3: dbscSNV_RF_SCORE>0.6 或 dbscSNV_ADA_SCORE >0.6
PVS_t4:AAChange.knownGene获得外显子位置,InterVar整理了一个knowngenecanonical数据库,
如果为基因外显子的第一个或最后一个或在基因的3'端最后50bp以内
PVS: PVS_t1或PVS_t2符合
PVS_t3 不符合
PVS_t4 不符合
2.PS1和PM5(自动注释)
某个missense variant位点,这个位点的由正常的氨基酸A变为致病的氨基酸B, 如果这个位点上发生其他的突变,改变后生成的氨基酸仍然为B,则判断为PS1,如果 不为氨基酸B(当然也不为氨基酸A),这个时候判断为PM5。
We first filtered ClinVar (subject to the same data-cleaning procedure described above), picked out all missense variants annotated as pathogenic, and stored the amino acid changes in an InterVar-specific database. We also inferred the splicing impact of these exonic missense variants by ANNOVAR from the ‘‘dbscsnv11’’ database to assess the possibility that they act through splicing disruption rather than amino acid changes.
PS1程序实现:
Func.refGene和’ExonicFunc.refGene’中包含[“missense”, “nonsynony”] 且突变后的氨基酸为已知致病的氨基酸,且不满足(dbscSNV_RF_SCORE>0.6 或 dbscSNV_ADA_SCORE >0.6) (InterVar判断有误,已提交bugs)
PM5程序实现:
Func.refGene和’ExonicFunc.refGene’中包含[“missense”, “nonsynony”] 且位点在他的AA_change库里面,且不为它库里的突变
3.PS2 and PM6 (手动注释)
PS2:在无家族史的病人中检测到的de novo variant (父母需做测序鉴定确认)。 PM6:De novo variant,但是没有进行父母亲子鉴定。
The de novo status of the variants gives strong support for the pathogenic status for PS2 if both maternity and paternity can be confirmed; if maternity or paternity is not confirmed, then moderate evidence of pathogenicity should be applied to PM6.
4.PS3 and BS3 (手动注释)
体内和体外功能研究均证实变异对基因和基因产物有损害效应,则分配PS3,无害则分配BS3 期待Annovar更新这方面的的数据库
5.BA1, BS1, BS2, PS4, and PM2(自动注释)
这个是根据alleles频率来划分,因为毕竟高频突变不会发生在低频的病中,不然关联性就不强了嘛 We retrieved information on disease prevalence from OrphaNet and translated OrphaNet identifiers into OMIM identifiers
BA1: the NHLBI Exome Sequencing Project (ESP6500), 1000 Genomes Project, and ExAC Browser数据库中只要出现一次该等位基因频率>5%
BA1程序实现:
1000g2015aug_all,esp6500siv2_all,ExAC_ALL只要有一个是大于5%
BS1:Allele frequency (ExAC Browser)is greater than expected for disorder(intervar默认的阈值为1%)
BS1程序实现:
1000g2015aug_all,esp6500siv2_all,ExAC_ALL只要有一个是大于0.5%
BS2:If a variant is observed in a healthy adult in the 1000 Genomes Project as a homozygote (for diseases defined as recessive in OMIM) or as a heterozygote otherwise(这里没有采用ExAC Browser或ESP6500数据库是因为这些数据库的突变位点可能跟各种疾病都有关系)
BS2程序实现:
Gene.ensGene找到mim2gene中的名字A,再用来mim_adultonset中找,如果找到,就为0 名字A在mim_recessive中找,如果找到就为1 mim_domin_dict找到则为0,同时若BS2_snps_domin找到,又可以变为1
PM2:在ESP数据库中(Exome Sequencing Project),千人数据库(1000 Genomes Project)及 ExAC数据库(Exome AggregationConsortium)中,对于显性遗传等位基因频率都为0,对于隐性遗传的突变频率允许是低频(<0.5%)。 If a variant that is responsible for dominant diseases is absent in all control subjects from ESP6500, 1000 Genomes Project, and the ExAC Browser, PM2 will be applied. If the variant causes recessive diseases and has a very low frequency with AAF < 0.5%, then PM2 can also be applied. Inter整理出了隐形疾病对应哪些基因,显性疾病对应哪些基因 程序实现: 在任何一个数据库中都没有频率 或在mim_recessive库里面(通过Gene,Gene.ensGene建立索引),但是所有库里面的频率不超过0.1% PS4: 变异在受影响个体中的发生率显著高于对照人群中的发生率。 注1:病例对照研究中,相对风险或OR>5.0,且置信区间内不包括1.0。 注2:当携带变异的个体很少,在病例对照研究没有达到统计学意义;前期研究中多个具有相同 表型且不相关的患者发现该变异,而在对照人群中未发现该突变,则可认为是中等水平的证据。 (这里的值也是可以手动调节,intervar也整理了一个库) (InterVar准备了这样的一个库)
6.PM1 (自动注释)
PM1:位于突变热点区或者非常明确的功能结构域,且该区域无任何良性突变。如酶的活性区域。 这个可以通过dbnsfp31a_interpro得到功能域的注释
We first annotated all ClinVar variants (subject to the same data-cleaning procedure described above) with protein-domain information and then compiled a list in which domains contained only pathogenic or likely pathogenic variants without benign or common (allele frequency
5%) variants. This list is provided within the InterVar package and will be updated regularly.
程序实现:
PM1:
Func.refGene和’ExonicFunc.refGene’中包含[“missense”, “nonsynony”] 且注释出来的Interpro_domain不在domains_with_benigns区域
7.PM3 and BP2 (手动注释)
The pathogenicity of a variant also needs to be evaluated on the basis of whether variants with known pathogenicity exist in cis or trans with it. InterVar does not know the cis/trans status for variants, so this needs to be provided by users in the second step (manual adjustment) of InterVar. For two heterozygous variants that are present in a gene associated with recessive disorders, if one is pathogenic and the other is located in trans, then moderate evidence of PM3 will be applied. If more than two variants are observed in trans, then moderate evidence for pathogenicity can be upgraded to strong. If the variants are present in a gene associated with dominant diseases, yet one variant is pathogenic and the other is located in trans, then supporting evidence of benign status will be applied to BP2 for the other variant. Regardless of models of disease inheritance, for two variants, if one is pathogenic and the other is observed
8.PM4 and BP3 (自动注释)
通过UCSC Genome Browser的rmsk数据库(这个数据库早就不更新罗),这个库是由RepeatMasker产生的。
PM4:non-frameshift insertion,non-frameshift deletion in the non-repeat region, or stop-loss variants BP3:nonframeshift insertion or non-frameshift deletion in the repeat region
PM4程序实现:
Func.refGene和ExonicFunc.refGene中能找到[“nonframeshift insertion”, “nonframeshift deletion”, “stoploss”] 且不出在repeat区域(rmsk库注释),如果在repeat区域,注释结果里面需要含有stoploss
BP3程序实现;
Func.refGene和ExonicFunc.refGene中能找到[“nonframeshift insertion”, “nonframeshift deletion”, “stoploss”] rmsk有注释结果,Interpro_domain没有注释结果
9.PP1 and BS4 (手动注释)
Familial segregation of a variant with a disease is an important sign for linking the variant to the disease. If segregation is found in multiple affected family members and if this gene is definitively known to be associated with this disease, then PP1 will be applied. When there is a lack of segregation in affected members of a family, then the benign supporting evidence of BS4 will be applied. Because InterVar does not know the information on segregation, this piece of evidence can be provided
10.PP2 and BP1 (自动注释)
跟该基因中主流的突变相同的代表PP2,跟主流的突变不一样的为BP1。 InterVar整理出了哪些除了BP1和PP2对应的基因名。
For many genes, the spectrum or distribution of pathogenic and benign variants can be informative for the pathogenicity status. For a given gene, if the missense variants are common causes of the disorder and the gene also has very few benign variants, then a missense variant in this gene can be supporting evidence for pathogenicity, and PP2 will be applied. However, if the truncating variants are major causes of the disease, then a missense variant in this gene can be supporting evidence for a benign status, and BP1 will be applied.
We annotated all variants in ClinVar (subject to the same datacleaning procedure described above). For a given gene, if most of the pathogenic variants (>80% and at least one variant) are missense, and if a small proportion (<10% and less than one variant) of missense variants are benign, then for missense variants, PP2 will be assigned as 1. The treatment for BP1 is similar to that for PP2, but we assess whether most of pathogenic variants (>80% and at least one variant) are truncating variants. The truncating variants are defined as stop-gain, stop-loss, frameshift indel, or those disrupting splice sites. If the user’s variants are missense in this gene, BP1 will be assigned as 1.
PP2程序识别: Func.refGene和’ExonicFunc.refGene’中包含[“missense”, “nonsynony”] 且Gene在PP2_genes_dict的库中
BP1程序识别: Func.refGene和’ExonicFunc.refGene’中包含[“missense”, “nonsynony”] 且Gene在BP1_genes_dict的库中
11.PP3 and BP4 (自动注释)
PP3:各种软件预测出来的deleterious BP4:各种软件预测不能预测出deleterious
All sets of in silico results must agree when PP3 or BP4 is assigned. 来自数据库dbnsfp30a,MetaSVM score用于deleteriousness prediction;GERPþþ用于 evolutionary conservation 阈值:
- 0.0 for MetaSVM scores (greater scores indicate more likely deleterious effects)
- 2.0 for GERPþþ_RS (smaller scores indicate less conservation),
- 0.6 for adaptive boosting (ADA) and random forest (RF) scores of dbscSNV as splicing impact (larger scores indicate more likely splice altering) 没有mcap和revel,没有revel和polyphen
PP3自动注释:
dbscSNV_RF_SCORE>0.6 或dbscSNV_ADA_SCORE>0.6 且GERP++_RS> 2 且Func.refGene和ExonicFunc.refGene注释出来的不含有(“synon”, “coding-synon)或MetaSVM_score>0 (nonsynon是含有synon字符的呀??这个判断真的对么??)
BP4自动注释:
dbscSNV_RF_SCORE<=0.6 或dbscSNV_ADA_SCORE<=0.6 且GERP++_RS <= 2 且Func.refGene和ExonicFunc.refGene注释出来的含有(“synon”, “coding-synon),不含有nonsynon或MetaSVM_score<=0
12.PP4 (手动注释)
PP4:患者的表型或家族病史对一个单一遗传因素的疾病具有高度特异性。
13.PP5 and BP6 (自动注释)
PP5:权威的研究和数据库支持该变异是致病的,但缺乏实验室独立评估的证据。 BP6:权威研究和数据库评定变异为良性,但缺乏实验室独立评估的证据。 InterVar默认的是Clinvar的数据库,用户可以自定义数据库(例如HGMD)
PP5自动注释: clinvar注释结果,只要注释中有ikely pathogenic或athogenic就判断致病
BP6自动注释: clinvar注释结果,只要注释中有ikely benign 或enign就判断致病
14.BP5 (手动注释)
If a disease has an alternate molecular basis (caused by more than one gene) and if a variant is observed in a gene related to the disease, then it will be supporting evidence for a benign status, and BP5 will be assigned as 1. Note that this criterion is stronger for a gene associated with a dominant disorder than for a gene associated with a recessive disorder. Because of the multiple exceptions for this criterion, as described before,25 users can adjust this criterion
15.BP7(自动注释)
BP7:如果一个同一突变既没有对splicing区域有影响,也没有在保守区域
dbscSNV数据库:dbscSNV_RF_SCORE and dbscSNV_ADA_SCORE should be 2 高度保守区域
BP7程序识别:
Func.refGene和ExonicFunc.refGene注释出来的含有(“synon”, “coding-synon),不含有nonsynon,且dbscSNV_RF_SCORE和dbscSNV_RF_SCORE<0.6, 且GERP++_RS<2
参考资料:
InterVar: Clinical Interpretation of Genetic Variants by the 2015 ACMG-AMP Guidelines
Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology
个人公众号,比较懒,很少更新,可以在上面提问题,如果回复不及时,可发邮件给我: tiehan@sina.cn