【5.4】igblast简介及参数设置
在NCBI开发IgBLAST以促进免疫球蛋白和T细胞受体可变结构域序列的分析。
IgBLAST允许用户查看种系V,D和J基因的匹配,重排连接处的细节,IG V结构域区域和互补决定区的描述。 IgBLAST具有分析核苷酸和蛋白质序列的能力,并且可以批量处理序列。 此外,IgBLAST允许同时针对种系基因数据库和其他序列数据库进行搜索,以最小化可能最佳匹配的种系V基因缺失的机会。
- 在线使用:http://www.ncbi.nlm.nih.gov/igblast/
- 更新说明: https://ncbiinsights.ncbi.nlm.nih.gov/tag/igblast/
一、安装
1.1 igblast安装
下载地址:
https://ftp.ncbi.nih.gov/blast/executables/igblast/release/LATEST/
选择最新版1.17 :
cd cd /data/software/igblast/
wget -c https://ftp.ncbi.nih.gov/blast/executables/igblast/release/LATEST/ncbi-igblast-1.17.0-x64-linux.tar.gz
tar -xzf ncbi-igblast-1.17.0-x64-linux.tar.gz
修改环境变量
vim /etc/profile
igblast_path=/data/software/igblast/ncbi-igblast-1.17.0/bin
export PATH=$igblast_path:$PATH
source /etc/profile
1.2 数据库
internal optional 数据集
cd /data/database/igblast
#database
wget -r 1 -p ftp://ftp.ncbi.nih.gov/blast/executables/igblast/release/database/
cp -fr ./ftp.ncbi.nih.gov/blast/executables/igblast/release/database ./
# internal_data
wget -r 1 -p ftp://ftp.ncbi.nih.gov/blast/executables/igblast/release/internal_data/
cp -fr ./ftp.ncbi.nih.gov/blast/executables/igblast/release/internal_data ./
#optional_file
wget -r 1 -p ftp://ftp.ncbi.nih.gov/blast/executables/igblast/release/optional_file/
cp -fr ./ftp.ncbi.nih.gov/blast/executables/igblast/release/optional_file ./
# 删除
rm -fr ftp.ncbi.nih.gov
IMGT数据
下载地址IMGT序列:http://www.imgt.org/vquest/refseqh.html#VQUEST
需要通过makeblastdb来构建
makeblastdb -parse_seqids -dbtype nucl -in my_seq_file
具体构建方法,见 https://ncbi.github.io/igblast/cook/How-to-set-up.html
wget -c ftp://ftp.ncbi.nih.gov/blast/executables/igblast/release/edit_imgt_file.pl
# V-segment database
$perl edit_imgt_file.pl IMGT_Mouse_IGHV.fasta > ./database/mouse_igh_v
$makeblastdb -parse_seqids -dbtype nucl -in ./database/mouse_igh_v
# J-segment database
$perl edit_imgt_file.pl IMGT_Mouse_IGHJ.fasta > ./database/mouse_igh_j
$makeblastdb -parse_seqids -dbtype nucl -in ./database/mouse_igh_j
# D-segment database
$perl edit_imgt_file.pl IMGT_Mouse_IGHD.fasta > ./database/mouse_igh_d
$makeblastdb -parse_seqids -dbtype nucl -in ./database/mouse_igh_d
添加环境变量
vim /etc/profile
export BLASTDB='/data/database/igblast'
#export internal_data='/data/database/igblast/internal_data'
export IGDATA='/data/database/igblast'
source /etc/profile
如果不增加数据库的环境变量,就会报/internal_data/ 找不到
二、运行
cd /data/software/igblast/ncbi-igblast-1.9.0
mkdir test;cd test
igblastp -germline_db_V igblast/database/mouse_gl_V -query test.fa -outfmt 3 -organism human
用IMGT germline database
#igblastp -germline_db_V igblast/imgt_201807/IGKVLV -germline_db_J igblast/imgt_201807/IGKVLV -germline_db_D igblast/imgt_201807/IGKVLV -organism human -query test.fa -auxiliary_data igblast/optional_file/human_gl.aux -show_translation
./bin/igblastn -query infile.fasta -out outfile.igblast.fmt7.out -outfmt 7 -germline_db_V ./database/mouse_gl_V -germline_db_J ./database/mouse_gl_J -germline_db_D ./database/mouse_gl_D -auxiliary_data ./optional_file/mouse_gl.aux -organism mouse -domain_system imgt -ig_seqtype Ig -show_translation -num_threads 10
参数说明
输出格式
*** Formatting options
-outfmt <String>
alignment view options:
3 = Flat query-anchored, show identities,
4 = Flat query-anchored, no identities,
7 = Tabular with comment lines
19 = Rearrangement summary report (AIRR format)
Options 7 can be additionally configured to produce
a custom format specified by space delimited format specifiers.
The supported format specifiers are:
qseqid means Query Seq-id
qgi means Query GI
qacc means Query accesion
qaccver means Query accesion.version
qlen means Query sequence length
sseqid means Subject Seq-id
sallseqid means All subject Seq-id(s), separated by a ';'
sgi means Subject GI
sallgi means All subject GIs
sacc means Subject accession
saccver means Subject accession.version
sallacc means All subject accessions
slen means Subject sequence length
qstart means Start of alignment in query
qend means End of alignment in query
sstart means Start of alignment in subject
send means End of alignment in subject
qseq means Aligned part of query sequence
sseq means Aligned part of subject sequence
evalue means Expect value
bitscore means Bit score
score means Raw score
length means Alignment length
pident means Percentage of identical matches
nident means Number of identical matches
mismatch means Number of mismatches
positive means Number of positive-scoring matches
gapopen means Number of gap openings
gaps means Total number of gaps
ppos means Percentage of positive-scoring matches
frames means Query and subject frames separated by a '/'
qframe means Query frame
sframe means Subject frame
btop means Blast traceback operations (BTOP)
staxid means Subject Taxonomy ID
ssciname means Subject Scientific Name
scomname means Subject Common Name
sblastname means Subject Blast Name
sskingdom means Subject Super Kingdom
staxids means unique Subject Taxonomy ID(s), separated by a ';'
(in numerical order)
sscinames means unique Subject Scientific Name(s), separated by a ';'
scomnames means unique Subject Common Name(s), separated by a ';'
sblastnames means unique Subject Blast Name(s), separated by a ';'
(in alphabetical order)
sskingdoms means unique Subject Super Kingdom(s), separated by a ';'
(in alphabetical order)
stitle means Subject Title
salltitles means All Subject Title(s), separated by a '<>'
sstrand means Subject Strand
qcovs means Query Coverage Per Subject
qcovhsp means Query Coverage Per HSP
qcovus means Query Coverage Per Unique Subject (blastn only)
- 默认的输出是3,包含的内容包括:‘qseqid sseqid pident length mismatch gapopen gaps qstart qend sstart send evalue bitscore’
- 可以这样来指定输出 -outfmt “7 qseqid sseqid pident length mismatch”
批量数据处理的时候,建议outfmt选择19,可以得到格式化的结果,例子如下
每一列的解释见:https://docs.airr-community.org/en/latest/datarep/rearrangements.html
三、讨论
3.1 5‘和3’对齐
加入参数extend_align3end和extend_align5end
export BLASTDB='/data/database/igblast';
export IGDATA='/data/database/igblast';
/data/software/igblast/ncbi-igblast-1.17.0/bin/igblastn -query query.fa -out out.tsv -outfmt 19 -germline_db_V ./imgt_20201124/mouse_v -germline_db_J ./imgt_20201124/mouse_j -germline_db_D ./imgt_20201124/mouse_d -auxiliary_data ./optional_file/mouse_gl.aux -organism mouse -domain_system imgt -ig_seqtype Ig -show_translation -num_threads 30 -num_clonotype 200000 -extend_align3end -extend_align5end
3.2 自建germline数据库
从 IMGT/GENE-DB 的 IG “V-REGION”, “D-REGION”, “J-REGION”, “C-GENE exon” sets 下载基因序列
然后:
cat IGHV_nucl.fa IGLV_nucl.fa IGKV_nucl.fa > mouse_v.fa
cat IGHJ_nucl.fa IGKJ_nucl.fa IGLJ_nucl.fa > mouse_j.fa
cat IGHD_nucl.fa >mouse_d.fa
/data/software/igblast/ncbi-igblast-1.17.0/bin/edit_imgt_file.pl mouse_v.fa > mouse_v;
/data/software/igblast/ncbi-igblast-1.17.0/bin/makeblastdb -parse_seqids -dbtype nucl -in mouse_v
/data/software/igblast/ncbi-igblast-1.17.0/bin/edit_imgt_file.pl mouse_j.fa > mouse_j;
/data/software/igblast/ncbi-igblast-1.17.0/bin/makeblastdb -parse_seqids -dbtype nucl -in mouse_j
/data/software/igblast/ncbi-igblast-1.17.0/bin/edit_imgt_file.pl mouse_d.fa > mouse_d;
/data/software/igblast/ncbi-igblast-1.17.0/bin/makeblastdb -parse_seqids -dbtype nucl -in mouse_d
参考资料
个人公众号,比较懒,很少更新,可以在上面提问题,如果回复不及时,可发邮件给我: tiehan@sina.cn