【8.3】通过VCF文件制作定制化的参考基因组

最近做的项目要对reference genome基于突变进行一些modify,制作personalized genome或者说是psuedo-genome伪基因组。其实就是把某个测序样本call出来的SNP && indel替换掉参考基因组对应位置的碱基。

自己可以编写脚本修改,使用Perl中的substr来进行单个位点修改,把坐标按照从后往前的顺序,而不是从前到后。

其实有蛮多工具可以做这个的,在这里安利一下,大家可以使用体验一下。

一、GATK FastaAlternateReferenceMaker

GATK作为通用的变异检测的软件,其中有很多有用的工具,这里介绍一下FastaAlternateReferenceMaker: Create an alternative reference by combining a fasta with a vcf.

准备好检测的VCF文件,参考基因组FASTA文件,使用如下命令:

/gatk_software_path/gatk FastaAlternateReferenceMaker \
   -R /ref_genome_path/reference-genome.fasta 
   -O /ref_genome_path/psuedo-genome.fa \
   -L input.intervals \
   -V /variant_path/input.vcf \
	[--snp-mask mask.vcf]

这里的-L参数可以多单个基因或者基因组一段区域进行替换。

–snp-mask参数,当构建时psuedo-genome.fa时,该VCF文件中的SNP用作掩码(在序列中插入N)。

运行结束,个人的参考基因组就构建好了,一般制作psuedo-genome.fa就是为了消除变异带来的影响,部分其他参数可以到gatk官网查阅。另外FastaAlternateReferenceMaker使用简单的indel,当VCF文件包含复杂的位点时(complex substitutions),会忽略。 PS:

If there are multiple variants that start at a site, it chooses one of them randomly.
When there are overlapping indels (but with different start positions) only the first will be chosen.

我得案例:

gatk FastaAlternateReferenceMaker -R /opt/ref/ref.fasta -V ${output}/${ID}/${ID}.vcf -O ${output}/${ID}/${ID}.genome.fasta

二、perEditor

perEditor用于分析phased SNPs/indels,意味着VCF文件中的GT是0|1/1|1/1|2,而不是我们常见的0/1这种,明确知道变异是在母本还是父本染色体上,第一个是母本,第二个是父本,使用方式如下:

perEditor ref_genome.fa snps_indels.vcf allele indivicual personalized_genome.fa
perEditor chr1.fa variations.vcf mother 1 chr1_customized.fa

这里对参数进行一下解释:

  • allele两个选项mother|father,代表等位基因来自于哪一个亲本;
  • indivicual,整数,这个是因为有一些VCF文件中不止包含一个样本的突变信息,参数代表选择第几个样本的突变来进行创建personalized genome,1代表第一个样本;

perEditor_ra,允许用户把染色体重排信息考虑到创建个性化参考基因组中,一样只服务于phased的数据,另外还需要把涉及染色体重排的染色体放在一个目录下:

$ ls chr*fa
chr1.fa  chr2.fa  chr3.fa  chr4.fa
$ perEditor_ra rearrangements.vcf chr_length.bed centromeres.bed individual allele
$ perEditor_ra rearrangements_t.vcf chr_length_t.bed centromeres_t.bed 1 father

这里的individual和allele参数和perEditor是一样的含义; rearrangements_t.vcf,染色体重排注释结果,vcf format version 4.1; chr_length_t.bed,文件中包含基因组每一条染色体的长度,即核苷酸总数,格式如下:

chr1    200
chr2    104
chr3    154
chr4    102

centromeres_t.bed,文件中包含每条染色体的着丝粒位置的文件,着丝粒坐标用于确定新染色体的身份,当两条染色体由于重排而合并时,新染色体将按照着丝粒来源的名称来命名,继承两个着丝粒,其名称则包含两个亲本染色体的名称,文件格式如下:

chr1    50  60
chr2    15  20
chr3    20  30
chr4    80  90

bed文件和vcf文件位于同一个工作目录下,最终生成的新的染色体命名中添加了new:

$ ls *new*fa
chr1_new.fa  chr2_new.fa  chr3_new.fa  chr4_new.fa

三、g2gtools

g2gtools通过将SNP和indels整合到参考基因组中来创建自定义基因组,提取感兴趣的区域(例如外显子或转录本),并在两个基因组之间转换文件(bam,gtf,bed)的坐标。 与其他liftover 工具不同,g2gtools不会丢弃掉落在indel区域上的alignments。 版本0.2可以创建个人二倍体基因组。 新版本仍然将个人基因组坐标上的二倍体比对转换回参考基因组,因此我们可以比较种群样本之间的比对。

安装

$ conda config --add channels r
$ conda config --add channels bioconda
$ conda create -n g2gtools python=2 jupyter ipykernel
$ source activate g2gtools
$ conda install -c kbchoi g2gtools

Usage

先激活virtual environment:

source activate g2gtools

创建自定义的基因组,需要下列信息:

${REF}         reference genome in fasta format
${STRAIN}      strain name (usually a column name in the vcf file)
${VCF_INDELS}  vcf file for indels
${VCF_SNPS}    vcf file for snps
${GTF}         gene annotation file in gtf format

下面是操作流程:

# 创建映射两个基因组之间的碱基的chain文件
g2gtools vcf2chain -f ${REF} -i ${VCF_INDELS} -s ${STRAIN} -o ${STRAIN}/REF-to-${STRAIN}.chain
# 把snp添补到ref genome上
g2gtools patch -i ${REF} -s ${STRAIN} -v ${VCF_SNPS} -o ${STRAIN}/${STRAIN}.patched.fa
# 将indels整合到snp添补后的基因组中
$ g2gtools transform -i ${STRAIN}/${STRAIN}.patched.fa -c ${STRAIN}/REF-to-${STRAIN}.chain -o ${STRAIN}/${STRAIN}.fa
# 针对新的定制基因组创建定制基因注释
$ g2gtools convert -c ${STRAIN}/REF-to-${STRAIN}.chain -i ${GTF} -f gtf -o ${STRAIN}/${STRAIN}.gtf
$ g2gtools gtf2db -i ${STRAIN}/${STRAIN}.gtf -o ${STRAIN}/${STRAIN}.gtf.db

四、vcf2diploid

vcf2diploid通过将phased的变异整合到参考基因组中,从vcf文件创建二倍体基因组。

Installation

git clone https://github.com/abyzovlab/vcf2diploid.git
cd vcf2diploid
make

Running

java -Xmx10g -jar vcf2diploid.jar -id sample_id -chr file1.fa file2.fa ... [-vcf file1.vcf file2.vcf ...] > logfile.txt
OPTIONS:
id    - (required) the ID of individual whose genome is being constructed (e.g., NA12878). The tool recognizes by this ID in the VCF file 
chr   - (required) FASTA file(s) of reference sequence(s) 
vcf   - (required) VCF4.0 file(s) containing variants from parents and the individual 
Xmx   - max memory allocation for JAVA. In this example, 10GB was allocated.
logfile.txt - stores the standard output produce from the run

五、bcftools consensus

Usage

About: Create consensus sequence by applying VCF variants to a reference fasta
       file. By default, the program will apply all ALT variants. Using the
       --sample (and, optionally, --haplotype) option will apply genotype
       (or haplotype) calls from FORMAT/GT. The program ignores allelic depth
       information, such as INFO/AD or FORMAT/AD.
Usage:   bcftools consensus [OPTIONS] <file.vcf>
Options:
    -c, --chain <file>         write a chain file for liftover
    -e, --exclude <expr>       exclude sites for which the expression is true (see man page for details)
    -f, --fasta-ref <file>     reference sequence in fasta format
    -H, --haplotype <which>    choose which allele to use from the FORMAT/GT field, note
                               the codes are case-insensitive:
                                   1: first allele from GT
                                   2: second allele
                                   R: REF allele in het genotypes
                                   A: ALT allele
                                   LR,LA: longer allele and REF/ALT if equal length
                                   SR,SA: shorter allele and REF/ALT if equal length
    -i, --include <expr>       select sites for which the expression is true (see man page for details)
    -I, --iupac-codes          output variants in the form of IUPAC ambiguity codes
    -m, --mask <file>          replace regions with N
    -o, --output <file>        write output to a file [standard output]
    -s, --sample <name>        apply variants of the given sample
Examples:
   # Get the consensus for one region. The fasta header lines are then expected
   # in the form ">chr:from-to".
   samtools faidx ref.fa 8:11870-11890 | bcftools consensus in.vcf.gz > out.fa

定制基因组

$ bcftools consensus <file.vcf> \
  --fasta-ref <file> \
  --iupac-codes \
  --output <file> \
  --sample <name>
 
# Apply variants present in sample "NA001", output IUPAC codes for hets
$ bcftools consensus -i -s NA001 -f in.fa in.vcf.gz > out.fa

# Create consensus for one region. The fasta header lines are then expected
# in the form ">chr:from-to".
$ samtools faidx ref.fa 8:11870-11890 | bcftools consensus in.vcf.gz -o out.fa

六、vcfutils.pl

samtools、bcftools和vcfutils.pl三个程序联合使用,从BAM文件开始操作,到获得新的参考基因组:

$ samtools mpileup -d8000 -q 20 -Q 10 -uf  REFERENCE.fasta Your_File.bam | bcftools call -c | vcfutils.pl vcf2fq  > OUTPUT.fastq
# d, --max-depth
# -q, -min-MQ Minimum mapping quality for an alignment to be used
# -Q, --min-BQ Minimum base quality for a base to be considered
$ sed -i -e '/^+/,/^\@/{/^+/!{/^\@/!d}}; /^+/ d; s/@/>/g' OUTPUT.fastq
References

参考资料

药企,独角兽,苏州。团队长期招人,感兴趣的都可以发邮件聊聊:tiehan@sina.cn
个人公众号,比较懒,很少更新,可以在上面提问题,如果回复不及时,可发邮件给我: tiehan@sina.cn