【2.2】GATK

Genome Analysis Toolkit

一、安装

cd /data/software
wget -c https://github.com/broadinstitute/gatk/releases/download/4.2.6.1/gatk-4.2.6.1.zip

unzip gatk-4.2.6.1.zip

gatk --help

二、Calling

两个工具:

UnifiedGenotyper Can’t get calls for multiple records at same position in the alleles input file; UnifiedGenotyper

One way is to separate out the SNPs and indels then re-merge them. 这是UG的硬伤呀,如果生成vcf中有两个点位置一样,则只能call其中一个,插入或删除突变的位置跟点突变的位置一样,则只能call 点突变,所以他们才建议插入删除突变单独跑 http://gatkforums.broadinstitute.org/gatk/discussion/2305/cant-get-calls-for-multiple-records-at-same-position-in-the-alleles-input-file-unifiedgenotyper

和HaplotypeCaller

3.HaplotypeCaller说明

https://software.broadinstitute.org/gatk/gatkdocs/3.7-0/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php

java -jar -Xmx8g /bioinfo/software/bin/GenomeAnalysisTK.jar
-l INFO
-T HaplotypeCaller
-R /bioinfo/data/iGenomes/Homo_sapiens/NCBI/build37.2/Sequence/WholeGenomeFasta/genome.fa -I NA14637-5_sort_realign_clipped.bam
-pairHMM VECTOR_LOGLESS_CACHING
-stand_emit_conf 10
-stand_call_conf 30.0
-D /bioinfo/data/SNP/SNP146_GRCh37.vcf
-L /home/qqin/genopipe_py/genopipe/data/brca_LP.target.zerobased.bed
-o /home/qqin/data_analysis/20161205_braca_lp_evaluation/1.analysis/9.call_variant/NA14637-5.g.vcf
-bamout /home/qqin/data_analysis/20161205_braca_lp_evaluation/1.analysis/9.call_variant/NA14637-5_sort_realign_recalibrate_clipped_hc.bam
--max_alternate_alleles 10
-A StrandBiasBySample
-gt_mode DISCOVERY
-S LENIENT

stand_emit_conf 位点得分低于这个值就不显示到vcf里面去

为什么UnifiedGenotyper算出来的DP值要比HaplotypeCaller的小?

HaplotypeCaller根据bam文件有个重新比对的过程, max_alternate_alleles 默认的为6

三、其他

。。。

四、讨论

1.depth大于AD

http://gatkforums.broadinstitute.org/gatk/discussion/1235/i-expect-to-see-a-variant-at-a-specific-site-but-its-not-getting-called https://software.broadinstitute.org/gatk/guide/article?id=5484

2.如何让gatk更快

http://gatkforums.broadinstitute.org/gatk/discussion/1975/how-can-i-use-parallelism-to-make-gatk-tools-run-faster

实际上有3个参数加速的参数:

-nt / --num_threads
-nct / --num_cpu_threads_per_data_thread

五、报错:

5.1.Input files reads and reference have incompatible contigs

原因:

This may be caused by a new check we added to validate that sequences match exactly.
 It sounds like it may be overreacting in your case. One way to bypass this is to remove the md5 string 
in the file headers, if you're sure that sequences are the same and haven't been modified.

解决办法:

 or you can try running with -U ALLOW_SEQ_DICT_INCOMPATIBILITY

5.2 GATK jar file not found. Have you run “gatk3-register”?

conda install -c conda-forge -c bioconda gatk

GenomeAnalysisTK -h 

然后报错: GATK jar file not found. Have you run “gatk3-register”?

需要下载gatk,然后注册一下

curl -L -o gatk-3.8.tar.bz2 https://anaconda.org/bioconda/gatk/3.8/download/noarch/gatk-3.8-hdfd78af_11.tar.bz2
gatk-register  gatk-3.8-hdfd78af_11.tar.bz2

参考资料

药企,独角兽,苏州。团队长期招人,感兴趣的都可以发邮件聊聊:tiehan@sina.cn
个人公众号,比较懒,很少更新,可以在上面提问题,如果回复不及时,可发邮件给我: tiehan@sina.cn