【1.1.1】近交系小鼠的GATK流程数据准备(基因组和dbsnp)

这个流程,我没有完全走通,先记录在这里,后续再来收拾。

小鼠基因组版本:

Mus musculus (Mouse):

Ensembl	GRCm38	NCBIM37	 	 
NCBI	GRCm38	build37.2	build37.1	 
UCSC	mm10	mm9

各种类型的小鼠参考基因组: https://ftp.ensembl.org/pub/release-106/fasta/mus_musculus_c57bl6nj/dna/

一、我的解决方案

mm10 UCSC genome:

cd /data4/neoantigen/mice
wget -c http://igenomes.illumina.com.s3-website-us-east-1.amazonaws.com/Mus_musculus/UCSC/mm10/Mus_musculus_UCSC_mm10.tar.gz
tar -zxvf Mus_musculus_UCSC_mm10.tar.gz

cd Mus_musculus/UCSC/mm10/Sequence/Chromosomes/

#Edit a header in the .fa files downloaded to remove “chr”:
for f in *.fa; do sed -i 's/chr//g' $f; done;

cat chr1.fa chr2.fa chr3.fa chr4.fa chr5.fa chr6.fa chr7.fa chr8.fa chr9.fa chr10.fa chr11.fa chr12.fa chr13.fa chr14.fa chr15.fa chr16.fa chr17.fa chr18.fa chr19.fa chrM.fa chrX.fa chrY.fa > ref_mm10.fasta

bwa index -a bwtsw ref_mm10.fasta

samtools faidx ref_mm10.fasta

singularity exec /data2/neoantigen/singularity/nextNEOpi_1.3.2_18734d43.sif gatk CreateSequenceDictionary --REFERENCE ref_mm10.fasta --OUTPUT ref_mm10.dict

##dbsnp

wget ftp://ftp-mouse.sanger.ac.uk/current_snps/mgp.v5.merged.snps_all.dbSNP142.vcf.gz
gunzip mgp.v5.merged.snps_all.dbSNP142.vcf.gz

grep -v 'ˆ#' mgp.v5.merged.snps_all.dbSNP142.vcf | awk '$7 == "PASS"' | cut -f1,2,4,5 | sort -u > a

#发现新的突变
grep -v 'ˆ#' filtered-gatk.vcf | awk '$7 == "PASS"' | cut -f1,2,4,5 | sort -u > b
comm -13 a b > novel_snps

awk -F' ' '{$2 = $2 FS "."; print $1," ",$2," ",$3, " ",$4}' novel_snps > mutant.vcf

基因组下载地址: https://ftp.ensembl.org/pub/release-106/fasta/mus_musculus_c57bl6nj/dna/

二、其他

建议参考:

GATK is design for human genetics, but it also work well for inbred mice.

However, one of my colleague who studies mouse genetics, said,

  • I tried the haplotype caller from GATK. But it seems that the haplotype caller is designed for heterogeneous genome like human than for mice. Therefore, the result coming out of HC is worse than samtools, as I manually inspected a few regions that HC calls didn’t make sense.
  • In addition, in one of their mouse genomic paper that we reviewed, they even skipped the second recalibration step. We asked them why and they said it was because of the same reason: good for human but not that good for the homogeneous inbred mouse.

With my own experience with GATK4, I found that:

  • SNP: at least 97% of the time, they both have the same call for inbred mice.
  • Indels: GATK is a preferable alogrithm for calling Indels (higher accuary and lower FDR), benefits from a assemble-based caller
  • While BCFtools is as good as GATK for calling SNPs (position-based caller).

GATK4的环境可以通过singularity调用,

https://apps-01.i-med.ac.at/images/singularity/nextNEOpi_1.3.2_18734d43.sif

我都选择的NCBI下载数据

1. Genome

#or Sanger MGP (Mouse Genetics Programme) (这个地址不存在了)

wget -c ftp://ftp-mouse.sanger.ac.uk/ref/GRCm38_68.fa -O GRCm38_68.fa

需要只提取 C57BL的基因组么??

grep '>' GRCm38_68.fa  | grep "Mus musculus strain C57BL/6J chromosome" |grep -v 'genomic patch of type FI' |grep -v "localized genomic co" |sed 's/>//g' >extract_names
python ~/project/BPKit/bpkit/sequence/format_seq.py extract-seq  --inputfa  GRCm38_68.fa --resultfa C57BL.fa --seq_fp extract_names

小鼠基因组下载地址: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001635.20/

2. dbSNP

Download All in one vcf file from NCBI

wget ftp://ftp.ncbi.nih.gov/snp/organisms/archive/mouse_10090/VCF/00-All.vcf.gz -O mouse.dbsnp.vcf.gz

Download from the Sanger Mouse Genetics Programme (Sanger MGP ,我选择的这个地址)

wget -c https://ftp.ebi.ac.uk/pub/databases/mousegenomes/REL-1505-SNPs_Indels/mgp.v5.merged.snps_all.dbSNP142.vcf.gz
wget -c https://ftp.ebi.ac.uk/pub/databases/mousegenomes/REL-1505-SNPs_Indels/mgp.v5.merged.snps_all.dbSNP142.vcf.gz.tbi

3. Known Indels (这一步有报错,没有跑通,怀疑是跟下载的参考基因组有关系)

For mouse indels, the Sanger Mouse Genetics Programme (Sanger MGP) is probably the best resource. Download all MGP indels (5/2015 release,我选择的下面的地址):

wget -c https://ftp.ebi.ac.uk/pub/databases/mousegenomes/REL-1505-SNPs_Indels/mgp.v5.merged.indels.dbSNP142.normed.vcf.gz -O mgp.v5.indels.vcf.gz

wget -c https://ftp.ebi.ac.uk/pub/databases/mousegenomes/REL-1505-SNPs_Indels/mgp.v5.merged.indels.dbSNP142.normed.vcf.gz.tbi

查看vcf的head,下载的vcf文件中header中的##reference,一定要存在该reference,就是说你用到的reference 文件名字一定要是header中的这个名字。GATK对header的要求比较严格。(这里的到底需不需要把chr添加到vcf里面?? )

zcat mgp.v5.indels.vcf.gz |head -n 20
##reference=ftp://ftp-mouse.sanger.ac.uk/ref/GRCm38_68.fa

Filter for passing variants

帖子1 的方法:

#take header first
zcat mgp.v5.indels.vcf.gz | head -1000 | grep "^#" | cut -f 1-8 > mgp.v5.indels.pass.chr.vcf
#keep only passing and append
zcat mgp.v5.indels.vcf.gz | grep -v "^#" | cut -f 1-8  | grep -w "PASS"  >> mgp.v5.indels.pass.chr.vcf

#Sort VCF (automatically generated index has to be deleted due to a known bug -> No anymore):
##用的是gatk4.6
gatk CreateSequenceDictionary  -R GRCm38_68.fa -O GRCm38_68.dict
gatk SortVcf -SD GRCm38_68.dict -R GRCm38_68.fa -I mgp.v5.indels.pass.chr.vcf -O mgp.v5.indels.pass.chr.sort.vcf
#java -Xms16G -Xmx16G -jar /data/software/picard.jar SortVcf -VERBOSITY WARNING -SD GRCm38_68.dict -I mgp.v5.indels.pass.chr.vcf -O mgp.v5.indels.pass.chr.sort.vcf

帖子2 的方法 (是否将chr加入到vcf的坐标里面):

zcat mgp.v5.indels.vcf.gz | head -1000 | grep "^#" | cut -f 1-8 | grep -v "#contig" | grep -v "#source" > mgp.v5.indels.pass.chr.vcf
zcat mgp.v5.indels.vcf.gz | grep -v "^#" | cut -f 1-8 | grep -w "PASS" | sed 's/^\([0-9MXY]\)/chr\1/' >> mgp.v5.indels.pass.chr.vcf
wget -c https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/download?filename=GCF_000001635.24.zip&ncbi_phid=322C4CC2C3F757550000428F8529D847.1.m_4.022
gatk SortVcf   -I mgp.v5.indels.pass.chr.vcf   -O mgp.v5.indels.pass.chr.sort.vcf   -R GRCm38_68.fa   -SD GRCm38_68.dict

参考资料

药企,独角兽,苏州。团队长期招人,感兴趣的都可以发邮件聊聊:tiehan@sina.cn
个人公众号,比较懒,很少更新,可以在上面提问题,如果回复不及时,可发邮件给我: tiehan@sina.cn