【2.2】Illumina Variant Calling Pipeline

Briefly, the Illumina variant calling pipeline is as follows: After extracting FASTQ files from the SRA Normalized data, the reads are trimmed based on paired/single-end status using trimmomatic. Next they are aligned to the SARS-CoV-2 reference (NC_045512.2) using HISAT2 and variants are called using GATK. Low quality variant calls are then filtered-out, the calls are normalized, then the calls are annotated for their protein effect using snpeff, and the VCF file validated.

ASTQ extraction

fasterq-dump --split-files $acc

Read Trimming


java -jar /usr/local/trimmomatic/0.33/trimmomatic-0.33.jar PE -phred33 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36


java -jar /usr/local/trimmomatic/0.33/trimmomatic-0.33.jar SE -phred33 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36


hisat2 --no-spliced-alignment --no-unal -x $ref -q | samtools view -Sb -F256 - | samtools sort > $output.bam  
samtools index $output.bam

Variant Calling

gatk HaplotypeCaller -R $ref -I $input.bam -O $output.gvcf --minimum-mapping-quality 10 --ploidy 2 -ERC BP_RESOLUTION

Variant Filtering

gatk VariantFiltration -R $ref -V $input -O $output \  
--filter-name "lowAD10" \  
--filter-expression 'vc.getGenotype("{wildcards.acc}").getAD().1 < 10' \  
--filter-name "lowQUAL100" \  
--filter-expression 'QUAL < 100'  
--filter-name "genomeEnd" \  
--filter-expression 'POS > 29850' \  
--filter-name "highFS60" \  
--filter-expression 'FS >= 60.0' \  
--filter-name "lowQD2.0" \  
--filter-expression 'QD < 2.0' \  
--filter-name "lowReadPosRankSum4.0" \  
--filter-expression 'ReadPosRankSum < -4.0' \  
--filter-name "highSOR4.0" \  
--filter-expression 'SOR >= 4.0'  

Variant Normalization

gatk LeftAlignAndTrimVariants --split-multi-allelics \  
-R $ref \  
-V $input \  
-O $output

Variant Annotation

snpeff ann -nodownload -formatEff -classic -noStats -noLog -quiet -no-upstream -no-downstream -c $snpeff_config sars2 -v $input


个人公众号,比较懒,很少更新,可以在上面提问题,如果回复不及时,可发邮件给我: tiehan@sina.cn