【2.2.4.1】GATK
Genome Analysis Toolkit
- 官网:https://software.broadinstitute.org/gatk/guide/topic?name=tutorials
- GATK使用方法详解(原始数据的处理) http://www.plob.org/article/7009.html
一、安装
cd /data/software
wget -c https://github.com/broadinstitute/gatk/releases/download/4.2.6.1/gatk-4.2.6.1.zip
unzip gatk-4.2.6.1.zip
gatk --help
二、Calling
两个工具:
UnifiedGenotyper Can’t get calls for multiple records at same position in the alleles input file; UnifiedGenotyper
One way is to separate out the SNPs and indels then re-merge them. 这是UG的硬伤呀,如果生成vcf中有两个点位置一样,则只能call其中一个,插入或删除突变的位置跟点突变的位置一样,则只能call 点突变,所以他们才建议插入删除突变单独跑 http://gatkforums.broadinstitute.org/gatk/discussion/2305/cant-get-calls-for-multiple-records-at-same-position-in-the-alleles-input-file-unifiedgenotyper
和HaplotypeCaller
3.HaplotypeCaller说明
java -jar -Xmx8g /bioinfo/software/bin/GenomeAnalysisTK.jar
-l INFO
-T HaplotypeCaller
-R /bioinfo/data/iGenomes/Homo_sapiens/NCBI/build37.2/Sequence/WholeGenomeFasta/genome.fa -I NA14637-5_sort_realign_clipped.bam
-pairHMM VECTOR_LOGLESS_CACHING
-stand_emit_conf 10
-stand_call_conf 30.0
-D /bioinfo/data/SNP/SNP146_GRCh37.vcf
-L /home/qqin/genopipe_py/genopipe/data/brca_LP.target.zerobased.bed
-o /home/qqin/data_analysis/20161205_braca_lp_evaluation/1.analysis/9.call_variant/NA14637-5.g.vcf
-bamout /home/qqin/data_analysis/20161205_braca_lp_evaluation/1.analysis/9.call_variant/NA14637-5_sort_realign_recalibrate_clipped_hc.bam
--max_alternate_alleles 10
-A StrandBiasBySample
-gt_mode DISCOVERY
-S LENIENT
stand_emit_conf 位点得分低于这个值就不显示到vcf里面去
为什么UnifiedGenotyper算出来的DP值要比HaplotypeCaller的小?
HaplotypeCaller根据bam文件有个重新比对的过程, max_alternate_alleles 默认的为6
三、其他
。。。
四、讨论
1.depth大于AD
http://gatkforums.broadinstitute.org/gatk/discussion/1235/i-expect-to-see-a-variant-at-a-specific-site-but-its-not-getting-called https://software.broadinstitute.org/gatk/guide/article?id=5484
2.如何让gatk更快
实际上有3个参数加速的参数:
-nt / --num_threads
-nct / --num_cpu_threads_per_data_thread
五、报错
5.1.Input files reads and reference have incompatible contigs
原因:
This may be caused by a new check we added to validate that sequences match exactly.
It sounds like it may be overreacting in your case. One way to bypass this is to remove the md5 string
in the file headers, if you're sure that sequences are the same and haven't been modified.
解决办法:
or you can try running with -U ALLOW_SEQ_DICT_INCOMPATIBILITY
5.2 GATK jar file not found. Have you run “gatk3-register”?
conda install -c conda-forge -c bioconda gatk
GenomeAnalysisTK -h
然后报错: GATK jar file not found. Have you run “gatk3-register”?
需要下载gatk,然后注册一下
curl -L -o gatk-3.8.tar.bz2 https://anaconda.org/bioconda/gatk/3.8/download/noarch/gatk-3.8-hdfd78af_11.tar.bz2
gatk-register gatk-3.8-hdfd78af_11.tar.bz2
参考资料
个人公众号,比较懒,很少更新,可以在上面提问题,如果回复不及时,可发邮件给我: tiehan@sina.cn