Pfam(http://pfam.sanger.ac.uk/)是一个被广泛使用的蛋白家族数据库,在最新的版本26.0中包含超过13000个手工确定的蛋白家族,Pfam可以通过http://pfam.sanger.ac.uk/使用,他有两个数据库,高质量,手工确定的Pfam-A,自动注释的Pfam-B数据库。后面的数据产生是根据ADDA算法。是对A的补充。

下载:

PfamScan.pl工具(ftp://ftp.sanger.ac.uk/pub/databases/Pfam/Tools)
数据库(ftp://ftp.sanger.ac.uk/pub/databases/Pfam/current_release/),按照说明书,我下载的是

Pfam-A.hmm
Pfam-A.hmm.dat 
Pfam-B.hmm 
Pfam-B.hmm.dat 
active_site.dat
HMMER3 (http://hmmer.janelia.org/software)

准备工作:

Perl 和bioperl的安装 我的已经安装过了,据说可以通过一下方法安装
 sudo apt-get install perl ( replace perl with bioperl for installation of bioperl)
 
Moose的安装
 sudo -i ( the system will ask for password type it in and youll find the user name change to root marked in red. its ready to go now) (因为之前没有权限,没用这一步,所以安装不来,导致后面的报错)
 then use CPAN to install Moose use this:
 CPAN Moose ( this will take a while)

HMMER3的安装

HMMER用来寻找同源序列数据库,做序列比对,它可用一条序列来寻找数据库,功能非常强大。
 tar zxf hmmer-3.1b1.tar.gz
 cd hmmer-3.1b1
 ./configure
 make
 make check
 make install
 cd easel;make install
 修改环境变量:export PATH=/sam/hmmer/binaries:$PATH(这个是针对bash而言的)
 这个时候可以通过在终端输入:hmmscan -h 来检验是否安装成功
 这就可以了嘛,不用怎么安装,修改环境变量即可。
 export PERL5LIB=/sam/hmmmer/pfamscan:$PATH (含有pfam_scan.pl)
 (the path to your pfam_scan.pl should be listed if it is successfully added)可以通过如下命令来查看环境变量是否修改成功
 perl -V
 
为什么用的是PERL5LIB而不是PATH呢
 What we’re doing in a nutshell is telling PERL to push values on to the @INC array before loading any modules. You can do this on the command line, in your PERL code or with the environment variable PERL5LIB.
 PERL5LIB can contain more than one value. Just set it in you .bashrc file or wherever you see fit. This method works in bash:
 export PERL5LIB=/first/path/to/libs"${PERL5LIB:+:$PERL5LIB}"

通过hmmerspress来把下载的数据建库:

hmmpress Pfam-A.hmm
hmmpress Pfam-B.hmm

使用说明:
 pfam_scan.pl -fasta -dir
 Additonal options:
 -h : show this help
 -o : output file, otherwise send to STDOUT
 -clan_overlap : show overlapping hits within clan member families (applies to Pfam-A families only)
 -align : show the HMM-sequence alignment for each match
 -e_seq : specify hmmscan evalue sequence cutoff for Pfam-A searches (default Pfam defined)
 -e_dom : specify hmmscan evalue domain cutoff for Pfam-A searches (default Pfam defined)
 -b_seq : specify hmmscan bit score sequence cutoff for Pfam-A searches (default Pfam defined)
 -b_dom : specify hmmscan bit score domain cutoff for Pfam-A searches (default Pfam defined)
 -pfamB : search against Pfam-B HMMs (uses E-value sequence and domain cutoff 0.001),
 in addition to searching Pfam-A HMMs
 -only_pfamB : search against Pfam-B HMMs only (uses E-value sequence and domain cutoff 0.001)
 -as : predict active site residues for Pfam-A matches
 -json [pretty] : write results in JSON format. If the optional value "pretty" is given,
 the JSON output will be formatted using the "pretty" option in the JSON
 module
 For more help, check the perldoc:
 shell% perldoc pfam_scan.pl

例如:
/sam/hmmer/PfamScan/pfam_scan.pl -fasta contig_proteins.fasta -dir /sam/hmmer/PfamScan/lib -pfamB -out contig_pfam.fasta

注释出来的结果中.后面跟的数字与不跟数字有什么区别??
 pfam-help@ebi.ac.uk
 There is no difference for the user.
 The extra numerals after the . are for internal auditing and have no meaning
 for the results. In effect both are PF00013.24 - that is: version 24 since
 first creation of family.

结果的初步解读:

# < seq id> < alignment start> < alignment end> < envelope start> < envelope end> < hmm acc>
 < hmm name> < type> < hmm start> < hmm end> < hmm length> < bit score> < E-value> < significance>
 < clan>
 1_1 111 424 110 425 PF01979.15 Amidohydro_1 Domain 2 332
 333 185.8 1.5e-54 1 CL0034
 1_2 30 130 30 130 PF13600.1 DUF4140 Family 1 104
 104 52.1 6.7e-14 1 No_clan

这里的PF代表的是pfam-A,PB代表的是pfam-B数据库。
clan表示上一级的分类

利用官网首页”Jump to”功能,检索注释出来的详细的信息:
Pfam A accession, e.g. PF02171
Pfam A identifier, e.g. piwi
Pfam B accession, e.g. PB000001
Pfam B identifier, e.g. Pfam-B_1
UniProt sequence accession, e.g. P00789
UniProt sequence ID, e.g. CANX_CHICK
NCBI “GI” number, e.g. 113594566
NCBI secondary accession, e.g. BAF18440.1
Pfam clan accession, e.g. CL0005
metaseq ID, e.g. JCVI_ORF_1096665732460
metaseq accession, e.g. JCVI_PEP_1096665732461
Pfam clan accession, e.g. CL0005
Pfam clan ID, e.g. Kazal
PDB entry, e.g. 2abl
Proteome species name, e.g. Homo sapiens
之前的邮箱不好使了。
pfamlist-subscribe@sanger.ac.uk

参考资料:

文献:The Pfam protein families database
官网说明说 readme
shuixia100的博客:http://shuixia100./1/post/2012/04/how-to-install-pfam_scanpl-under-linux-ubuntu.html
Brain Goo的博客:http://www.popmartian.com/tipsntricks/2011/04/11/how-to-pass-perl-library-paths-from-the-environment/

ps:

1,pfam团队的邮箱:pfam-help@sanger.ac.uk。有问题就可以问他们
2,Can’t locate Bio/Pfam/Scan/PfamScan.pm in @INC (@INC contains: /etc/perl /usr/local/lib/perl/5.14.2 /usr/local/share/perl/5.14.2 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.14 /usr/share/perl/5.14 /usr/local/lib/site_perl .) at /sam/hmmer/PfamScan/pfam_scan.pl line 8.
BEGIN failed–compilation aborted at /sam/hmmer/PfamScan/pfam_scan.pl line 8.这个问题折腾了我很久,最后我改了两点,一个就是通过cpan下载Moose,另一个就是修改了pfam_scan.pl的环境变量,就OK了。那就根据我博文中提到的PERL5LIB,我觉得应该是第二个原因。反正问题解决了,who care 呢?

7 thoughts on “PfamScan及fam数据库

  1. 最近一年只要碰上生物信息学不懂的东西,习惯性的到你博客看看。真是受益匪浅,最近在摸索注释,之前你博文中对COG的注释总结整理得非常好。我都试了各种cog.pl 但是都有问题,不知道能否向你分享相关的cog脚本。非常感谢

    1. Pfam数据库在2015年6月份已经更新了,没有Pfam-A 以及Pfam-B之分了,但是,我现在有个疑惑,我想利用这个数据库查找一些特定的基因,比如可移动性基因,我手上所用的数据都是基因数据,但是需要在Pfam蛋白质数据库中查找,最后希望输出的还是特定基因的基因数据,请问,我该怎么做,因为刚刚入手,还不是很会,谢谢指导!

  2. 博主,你好,非常感谢你的博客对于我学习的帮忙。最近在研究Pfam这个数据库,在2015年6月已经更新了,现在没有Pfam-A和Pfam-B的区分了,所以,建议博主改一下哦。在这里,想问博主一个问题,我研究的是基因序列,但是文献上面需要在Pfam这个蛋白质数据库中查找相关的基因,而且,输出的格式,最好是相关基因的始末位置,那我需要怎么做呢?(先用blast成蛋白质,然后再转化成DNA序列?)

  3. 博主你好!我现在也在做pfam_scan来对我的fasta序列进行分析,但是在运行pfam_scan.pl的时候总是遇到各种错误,如:
    Command ‘translate’ not found in /home/xufeng/xufeng/software/PfamScan, /home/xufeng/xufeng/software/CPAT-1.2.2/bin, /opt/sublime_text, /home/xufeng/xufeng/bin, /home/xufeng/xufeng/software/SHOREmap_v3.2, /home/xufeng/xufeng/software/genomemapper-0.4.4, /home/xufeng/xufeng/software/shore-0.9.3, /opt, /usr/share/samtools/, /opt/cufflinks-2.2.1, /opt/stringtie, /opt/eigen, /opt/tophat-2.1.0.Linux_x86_64, /opt/SOAPdenovo2, /opt/velvet_1.2.10/contrib/shuffleSequences_fasta, /opt/bwa-0.7.12, /opt/velvet_1.2.10, /usr/local/sbin, /usr/local/bin, /usr/sbin, /usr/bin, /sbin, /bin, /usr/games, /usr/local/games at /home/xufeng/xufeng/software/PfamScan/Bio/Pfam/Scan/PfamScan.pm line 853.
    不知道博主你当时是怎么解决这个问题的?

  4. 请问楼主:我centos 6.8 用CPAN 安装Moose,出现如下错误,是什么原因呢?
    cpan[2]> install Moose
    Running install for module ‘Moose’
    ETHER/Moose-2.2004.tar.gz
    Has already been unwrapped into directory /root/.cpan/build/Moose-2.2004-4
    ETHER/Moose-2.2004.tar.gz
    ‘/usr/bin/perl Makefile.PL INSTALLDIRS=site’ returned status 6400, not re-running

发表评论

电子邮件地址不会被公开。 必填项已用*标注