GATK

一、简介

Genome Analysis Toolkit

官网: https://software.broadinstitute.org/gatk/guide/topic?name=tutorials

GATK使用方法详解(原始数据的处理)

http://www.plob.org/article/7009.html

二、Calling

两个工具:

UnifiedGenotyper Can’t get calls for multiple records at same position in the alleles input file; UnifiedGenotyper

One way is to separate out the SNPs and indels then re-merge them. 这是UG的硬伤呀,如果生成vcf中有两个点位置一样,则只能call其中一个,插入或删除突变的位置跟点突变的位置一样,则只能call 点突变,所以他们才建议插入删除突变单独跑 http://gatkforums.broadinstitute.org/gatk/discussion/2305/cant-get-calls-for-multiple-records-at-same-position-in-the-alleles-input-file-unifiedgenotyper

和HaplotypeCaller

3.HaplotypeCaller说明

https://software.broadinstitute.org/gatk/gatkdocs/3.7-0/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php

java -jar -Xmx8g /bioinfo/software/bin/GenomeAnalysisTK.jar
-l INFO
-T HaplotypeCaller
-R /bioinfo/data/iGenomes/Homo_sapiens/NCBI/build37.2/Sequence/WholeGenomeFasta/genome.fa -I NA14637-5_sort_realign_clipped.bam
-pairHMM VECTOR_LOGLESS_CACHING
-stand_emit_conf 10
-stand_call_conf 30.0
-D /bioinfo/data/SNP/SNP146_GRCh37.vcf
-L /home/qqin/genopipe_py/genopipe/data/brca_LP.target.zerobased.bed
-o /home/qqin/data_analysis/20161205_braca_lp_evaluation/1.analysis/9.call_variant/NA14637-5.g.vcf
-bamout /home/qqin/data_analysis/20161205_braca_lp_evaluation/1.analysis/9.call_variant/NA14637-5_sort_realign_recalibrate_clipped_hc.bam
--max_alternate_alleles 10
-A StrandBiasBySample
-gt_mode DISCOVERY
-S LENIENT

stand_emit_conf 位点得分低于这个值就不显示到vcf里面去

为什么UnifiedGenotyper算出来的DP值要比HaplotypeCaller的小?

HaplotypeCaller根据bam文件有个重新比对的过程, max_alternate_alleles 默认的为6

三、其他

。。。

四、讨论

1.depth大于AD

http://gatkforums.broadinstitute.org/gatk/discussion/1235/i-expect-to-see-a-variant-at-a-specific-site-but-its-not-getting-called https://software.broadinstitute.org/gatk/guide/article?id=5484

2.如何让gatk更快

http://gatkforums.broadinstitute.org/gatk/discussion/1975/how-can-i-use-parallelism-to-make-gatk-tools-run-faster

实际上有3个参数加速的参数:

-nt / --num_threads
-nct / --num_cpu_threads_per_data_thread

五、报错:

1.Input files reads and reference have incompatible contigs

原因:

This may be caused by a new check we added to validate that sequences match exactly.
 It sounds like it may be overreacting in your case. One way to bypass this is to remove the md5 string 
in the file headers, if you're sure that sequences are the same and haven't been modified.

解决办法:

 or you can try running with -U ALLOW_SEQ_DICT_INCOMPATIBILITY

参考资料:

http://gatkforums.broadinstitute.org/gatk/discussion/6546/input-files-reads-and-reference-have-incompatible-contigs

个人公众号,比较懒,很少更新,可以在上面提问题:

更多精彩,请移步公众号阅读:

Sam avatar
About Sam
专注生物信息 专注转化医学