【2.3】biopython pairwise2实现两两序列比对

September 06, 2019 mapping 阅读量：次

一、pairwise2.align介绍

Pairwise sequence alignment使用动态规划算法。

这提供了在两个序列之间获得全局和局部比对的函数。全局对齐在两个序列中的所有字符之间找到最佳一致性。局部对齐只找到最佳对齐的子序列。局部比对必须具有正分数才能报告，并且不会针对“zero counting”匹配进行扩展。这意味着局部比对将始终以正计数匹配开始和结束。

在进行比对时，您可以指定匹配分数和间隙罚分。匹配分数表示序列中两个字符的对齐之间的兼容性。高度兼容的角色应该给出正分数，不相容的角色应该给出负分或0分。gap惩罚应该是负数。

此模块中的对齐函数的名称遵循约定

global + XX
local + XX

XX是2个字符的代码，表示它所采用的参数。

第一个字符表示匹配（和不匹配）的参数，第二个字符表示间隙罚分的参数。

匹配参数是：

x 没有参数。相同的字符得分为1，否则为0。
m 匹配分数是相同字符的分数，否则不匹配得分。
d 字典返回任意一对字符的分数。
c 回调函数返回分数。

gap 惩罚参数是：

x 无gap罚款。
s 同样的开放和延长两个序列的gap罚分。
d 序列具有不同的开放和延长间隙罚分。
c 回调函数返回间隙罚分。

所有不同的对齐函数都包含在对象align中。例如：

from Bio import pairwise2
alignments = pairwise2.align.globalxx("ACCGT", "ACG")

将返回两个字符串之间的对齐列表。要获得良好的打印输出，请使用模块的format_alignment方法：

>>> from Bio.pairwise2 import format_alignment
>>> print(format_alignment(*alignments[0]))
ACCGT
| ||
A-CG-
  Score=3
<BLANKLINE>

align函数中包含的参数：

Two sequences: strings, Biopython sequence objects or lists. Lists are useful for supplying sequences which contain residues that are encoded by more than one letter.
penalize_extend_when_opening: boolean (default: False). Whether to count an extension penalty when opening a gap. If false, a gap of 1 is only penalized an “open” penalty, otherwise it is penalized “open+extend”.
penalize_end_gaps: boolean. Whether to count the gaps at the ends of an alignment. By default, they are counted for global alignments but not for local ones. Setting penalize_end_gaps to (boolean, boolean) allows you to specify for the two sequences separately whether gaps at the end of the alignment should be counted.
gap_char: string (default: ‘-'). Which character to use as a gap character in the alignment returned. If your input sequences are lists, you must change this to ['-'].
force_generic: boolean (default: False). Always use the generic, non-cached, dynamic programming function (slow!). For debugging.
score_only: boolean (default: False). Only get the best score, don’t recover any alignments. The return value of the function is the score. Faster and uses less memory.
one_alignment_only: boolean (default: False). Only recover one alignment.

二、其他例子

齐函数的其他参数取决于调用的函数。一些例子：

2.1 找到两个序列之间的最佳全局对齐。相同的字符给出1分。对于不匹配或差距，不会扣除任何积分。

>>> for a in pairwise2.align.globalxx("ACCGT", "ACG"):
...     print(format_alignment(*a))
ACCGT
| ||
A-CG-
  Score=3
<BLANKLINE>
ACCGT
|| |
AC-G-
  Score=3
<BLANKLINE>

2.2 和以前一样，但是局部对齐。请注意，format_alignment仅显示序列的对齐部分以及起始位置。

>>> for a in pairwise2.align.localxx("ACCGT", "ACG"):
...     print(format_alignment(*a))
1 ACCG
  | ||
1 A-CG
  Score=3
<BLANKLINE>
1 ACCG
  || |
1 AC-G
  Score=3
<BLANKLINE>

2.3 做全局比对。相同的字符给出2个点，每个不相同的字符扣除1个点。不要惩罚gaps。

>>> for a in pairwise2.align.globalmx("ACCGT", "ACG", 2, -1):
...     print(format_alignment(*a))
ACCGT
| ||
A-CG-
  Score=6
<BLANKLINE>
ACCGT
|| |
AC-G-
  Score=6
<BLANKLINE>

2.4 与上述相同，除了现在打开间隙时扣除0.5分，并且在延长时扣除0.1分。

>>> for a in pairwise2.align.globalms("ACCGT", "ACG", 2, -1, -.5, -.1):
...     print(format_alignment(*a))
ACCGT
| ||
A-CG-
  Score=5
<BLANKLINE>
ACCGT
|| |
AC-G-
  Score=5
<BLANKLINE>

2.5 根据惩罚，一个序列中的间隙可能跟随另一个序列中的间隙。如果您不喜欢这种行为，请增加缺口开始的罚分

>>> for a in pairwise2.align.globalms("A", "T", 5, -4, -1, -.1):
...     print(format_alignment(*a))
A-
<BLANKLINE>
-T
  Score=-2
<BLANKLINE>
>>> for a in pairwise2.align.globalms("A", "T", 5, -4, -3, -.1):
...     print(format_alignment(*a))
A
.
T
  Score=-4
<BLANKLINE>

2.6 对齐函数还可以使用已包含在Biopython中的已知矩阵（来自Bio.SubsMat的MatrixInfo）：

>>> from Bio.SubsMat import MatrixInfo as matlist
>>> matrix = matlist.blosum62
>>> for a in pairwise2.align.globaldx("KEVLA", "EVL", matrix):
...     print(format_alignment(*a))
KEVLA
 |||
-EVL-
  Score=13
<BLANKLINE>

注：自带的打分矩阵包括：benner6, benner22, benner74, blosum100, blosum30, blosum35, blosum40, blosum45, blosum50, blosum55, blosum60, blosum62, blosum65, blosum70, blosum75, blosum80, blosum85, blosum90, blosum95, feng, fitch, genetic, gonnet, grant, ident, johnson, levin, mclach, miyata, nwsgappep, pam120, pam180, pam250, pam30, pam300, pam60, pam90, rao, risler, structure

2.7 使用参数c，您可以定义自己的匹配和间隙函数。例如。定义仿射对数间隙函数并使用它：

>>> from math import log
>>> def gap_function(x, y):  # x is gap position in seq, y is gap length
...     if y == 0:  # No gap
...         return 0
...     elif y == 1:  # Gap open penalty
...         return -2
...     return - (2 + y/4.0 + log(y)/2.0)
...
>>> alignment = pairwise2.align.globalmc("ACCCCCGT", "ACG", 5, -4,
...                                      gap_function, gap_function)

您可以为每个序列定义不同的间隙函数。自定义匹配函数必须将两个残基进行比较并返回分数。

三、我的案例

获得两两序列的相似比（这里用的是全局比对）

def get_similarity_pct(query, target):
    alignments = pairwise2.align.globalxx(query, target) # 全局比对，相同的残基就给1分，不同和gap不扣分
    seq_length = min(len(query), len(target))
    matches = alignments[0][2]
    percent_match = (matches / seq_length) * 100
    return percent_match

序列比对

def align_aa(query, target):
    alignments = pairwise2.align.globalds(query, target, MatrixInfo.blosum62, -10, -2)
    return alignments[0]


def print_aln(aln):
    """
    DIQMTQSPSSLSASVGDRVTITCRASENIYSYLAWYQQKPGKSPKLLIYNARTLAEGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQYHSGSPLPFGGGTKVEIKR
||||||||||||||||||||||||||||||||||||||||||.|||||||||||||||||||||||||||||||||||||||.||||||||||||||
DIQMTQSPSSLSASVGDRVTITCRASENIYSYLAWYQQKPGKVPKLLIYNARTLAEGVPSRFSGSGSGTDFTLTISSLQPEDVATYYCQYHSGSPLP-----------
  Score=465
    """
    print(format_alignment(*aln))

print_aln(align_aa(query, target))


print(align_aa(’YVIYSLH‘, ‘LAISDQTKHA’))

输出结果

('YVI--YSLH-', 'LAISDQTKHA', -13.0, 0, 10)
# 第一个是query序列，第二个是target序列，第三列是得分，后面两个是比对的起始和终止位置

参考资料

药企，独角兽，苏州。团队长期招人，感兴趣的都可以发邮件聊聊：tiehan@sina.cn

个人公众号，比较懒，很少更新，可以在上面提问题，如果回复不及时，可发邮件给我： tiehan@sina.cn