【2.1】蛋白质家族和结构域数据库

一、蛋白质模体及结构域数据库

模体和结构域
PROSITE数据库
PRINTS数据库
BLOCKS数据库
ProDom数据库
Pfam数据库
SMART数据库
InterPro数据库
Conserved Domain数据库
CDART

模体(motifs)和结构域 (domains):

Biologists can gain insight of the protein function based on identification of short consensus sequences related to known functions. These consensus sequence patterns are termed motifs and domains. A motif is a short conserved sequence pattern associated with distinct functions of a protein or DNA. It is often associated with a distinct structural site performing a particular function. A typical motif, such as a Zn-finger motif, is ten to twenty amino acids long.

A domain is also a conserved sequence pattern, defined as an independent functional and structural unit. Domains are normally longer than motifs. A domain consists of more than 40 residues and up to 700 residues, with an average length of 100 residues. A domain may or may not include motifs within its boundaries. Examples,transmembrane domains, ligand-binding domains.

Identification of motifs and domains heavily relies on multiple sequence alignment as well as profile and hidden Markov model (HMM) construction

PROSITE(蛋白质家族及结构域数据库):

The first established sequence pattern database www.expasy.org/prosite/ 是蛋白质家族和结构域数据库,包含具有生物学意义的位点、模式、可帮助识别蛋白质家族的统计特征。 PROSITE中涉及的序列模式包括酶的催化位点、配体结合位点、与金属离子结合的残基、二硫键的半胱氨酸、与小分子或其它蛋白质结合的区域等。 PROSITE还包括根据多序列比对而构建的序列统计特征,能更敏感地发现一个(未知)序列是否具有相应的特征。 The functional information of these patterns is primarily based on published literature.

PRINTS(蛋白质模体指纹数据库):

A fingerprint is a group of conserved motifs used to characterise a protein family; its diagnostic power is refined by iterative scanning of a SWISS-PROT/TrEMBL composite. Usually the motifs do not overlap, but are separated along a sequence, though they may be contiguous in 3D-space.. http://bioinf.man.ac.uk/dbbrowser/PRINTS/ 提供蛋白质同源性分析,蛋白质模体指纹分析,系统发生和序列进化分析,以及微阵列分析,并提供生物信息学和PRINTS数据库数据下载。

BLOCKS:

A database of blocks Blocks:ungapped multiple alignments derived from the most conserved, ungapped regions of homologous protein sequences.
The blocks, which are usually longer than motifs, are subsequently converted to PSSMs. Because blocks often encompass motifs, the functional annotation of blocks is thus consistent with that for the motifs http://blocks.fhcrc.org/blocks.

检测和鉴定蛋白质模体,有BLOCK search、Get Blocks和Block Maker工具 A query sequence can be used to align with precomputed profiles in the database to select the highest scored matches.

ProDom

Domain database ProDom is a comprehensive set of protein domain families automatically generated from the SWISS-PROT and TrEMBL sequence databases The domains are built using recursive iterations of PSI-BLAST. http://prodom.prabi.fr/prodom/current/html/home.php 提供相似性搜索、来自SWISSPROT相关结构域的多序列比对

Pfam(Protein families database of alignments and HMMs)

A database with protein domain derived from sequences in SWISSPROT and TrEMBL. Each motif or domain is represented by an HMM profile generated from the seed alignment of a number of conserved homologous proteins. http://pfam.janelia.org/ The Pfam database is composed of two parts Pfam-A involves manual alignments Pfam-B, automatic alignment in a way similar to ProDom( PSI-BLAST ). The functional annotation of motifs in Pfam-A is often related to that in PROSITE. Pfam-B only contains sequence families not covered in Pfam-A. Because of the automatic nature, Pfam-B has a much larger coverage but is also more error prone because some HMMs are generated from unrelated sequences.

SMART (Simple Modular Architecture Research Tool):

Contains HMM profiles constructed from manually refined protein domain alignments. http://smart.embl-heidelberg.de/ Alignments in the database are built based on tertiary structures whenever available or based on PSI-BLAST profiles. Alignments are further checked and refined by human annotators before HMM profile construction.

Protein functions are also manually curated. The database may be of better quality than Pfam with more extensive functional annotations.

Compared to Pfam, the SMART database contains an independent collection of HMMs, with emphasis on signaling, extracellular, and chromatin-associated motifs and domains.

Sequence searching in this database produces a graphical output of domains with well-annotated information with respect to cellular localization, functional sites, superfamily, and tertiary structure

InterPro:

An integrated pattern database www.ebi.ac.uk/interpro/ The database integrates information from PROSITE, Pfam, PRINTS, ProDom, and SMART databases. The sequence patterns from the five databases are further processed. Only overlapping motifs and domains in a protein sequence derived by all five databases are included. A popular feature of this database is a graphical output that summarizes motif matches and has links to more detailed information.

CDD( Conserved Domain Database)

a collection of multiple sequence alignments for ancient domains and full-length proteins. http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml The CD-Search service may be used to identify the conserved domains present in a protein

query sequence: http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi

RPS-BLAST (Reverse PSI-BLAST) is the search tool used in the CD-Search service. uses a query sequence to search against a pre-computed profile database generated by PSI-BLAST. The role of the PSSM has changed from “query” to “subject”, hence the term “reverse” in RPS-BLAST. It performs only one iteration of regular BLAST searching against a database of PSI-BLAST profiles to find the high-scoring gapped matches.

CDART (Conserved Domain Architecture) :

A domain search program www.ncbi.nlm.nih.gov/BLAST/

Combines the results from RPS-BLAST, SMART, and Pfam. The resulting domain architecture of a query sequence can be graphically presented along with related sequences. CDART is not a substitute for individual database searches because it often misses certain features that can be found in SMART and Pfam.

二、蛋白质家族数据库

COG (Cluster of Orthologous Groups ):

A protein family database based on phylogenetic classification. www.ncbi.nlm.nih.gov/COG/ It is constructed by comparing protein sequences encoded in completely sequenced genomes. Unicellular clusters:检索工具为COGnitor program Eukaryotic Clusters:检索工具为KOGnitor A query sequence can be assigned function if it has significant similarity matches with any member of the cluster.

ProtoNet:

A database of clusters of homologous proteins similar to COG. www.protonet.cs.huji.ac.il/ Orthologous protein sequences in the SWISSPROT database are clustered based on pairwise sequence comparisons between all possible protein pairs using BLAST. Protein relatedness is defined by the E-values from the BLAST alignments. A query protein sequence can be submitted to the server for cluster identification and functional annotation.

三、蛋白质结构数据库

PDB(Protein Data Bank)

PDB中含有通过实验(X射线晶体衍射,核磁共振NMR)测定的生物大分子的三维结构 蛋白质 核酸 糖类 其它复合物 http://www.rcsb.org/pdb

###SCOP(Structural Classification of Proteins )蛋白质结构分类数据库

提供关于已知结构的蛋白质之间结构和进化关系的详细描述,包括蛋白质结构数据库PDB中的所有条目。 http://scop.mrc-lmb.cam.ac.uk/scop/ SCOP数据库除了提供蛋白质结构和进化关系信息外,对于每一个蛋白质还包括下述信息:到PDB的连接,序列,参考文献,结构的图像等。 可以按结构和进化关系对蛋白质分类,分类结果是一个具有层次结构的树,其主要的层次是家族、超家族和折叠: 家族:具有明显的进化关系 超家族:具有远源进化关系,具有共同的进化源 折叠类:主要结构相似

DSSP(蛋白质二级结构数据库)

对生物大分子数据库PDB中的任何一个蛋白质,根据其三维结构推导出对应的二级结构。 http://www.sander.embl-heidelberg.de/dssp/ 对研究蛋白质序列与蛋白质二级结构及空间结构的关系非常有用 除了二级结构以外,DSSP还包括蛋白质的几何特征及溶剂。

HSSP(蛋白质同源序列比对数据库)

二级数据库 http://www.sander.embl-heidelberg.de/hssp/ 数据来源于PDB,或来源于SWISS-PROT 对于PDB中的每一个蛋白质,HSSP将与其同源的所有蛋白质序列对比排列起来,从而将相似序列的蛋白质聚集成结构同源的家族。 HSSP有助于分析蛋白质的保守区域,研究蛋白质的进化关系,有助于蛋白质的分子设计。

四、其它生物大分子数据库

MMDB (Molecular Modeling Database)

MMDB 是(NCBI)Entrez的一个部分,数据库的内容包括来自于实验的生物大分子结构数据。 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Structure 与PDB相比,对于数据库中的每一个生物大分子结构,MMDB具有许多附加的信息,如分子的生物学功能、产生功能的机制、分子的进化历史等 。 还提供生物大分子三维结构模型显示、结构分析和结构比较工具。

dbSNP( Single nucleotide polymorphisms,单核苷酸多态性数据库)

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=snp

OMIM (Online Mendelian Inheritance in Man)

是关于人类基因和遗传疾病的分类数据库 该数据库收集了已知的人类基因及由于这些基因突变或者缺失而导致的遗传疾病。 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM

EPD

真核基因启动子数据库 http://www.epd.isb-sib.ch/ 提供从EMBL中得到的真核基因的启动子序列,目标是帮助实验研究人员、生物信息学研究人员分析真核基因的转录信号。

TRRD (Transcription Regulatory Regions Database )

关于基因调控信息的集成数据库 该数据库搜集真核生物基因转录调控区域结构和功能的信息。

每一个TRRD的条目对应于一个基因,包含特定基因各种结构-功能特性 http://wwwmgs.bionet.nsc.ru/mgs/gnw/trrd/

参考资料:

李裕强老师的课件

http://wenku.baidu.com/link?url=942lI1TEoY6tmC4T2RHZIGqqCWPqHoYMN3cOVUdKN9a7z7ce2FoYwYllHPwieBv51foAEu1qjaFmYDLES9CzFbV1Pg4V4wyW8bDJzknNHty

药企,独角兽,苏州。团队长期招人,感兴趣的都可以发邮件聊聊:tiehan@sina.cn
个人公众号,比较懒,很少更新,可以在上面提问题,如果回复不及时,可发邮件给我: tiehan@sina.cn