pysam是python中用来读写SAM/BAM文件的一个模块。底层是用C写的

一、简介

pysam官网

Pysam is a wrapper of the htslib C-API and provides facilities to read and write SAM/BAM/VCF/BCF/BED/GFF/GTF/FASTA/FASTQ files as well as access to the command line functionality of the samtools and bcftools packages. The module supports compression and random access through indexing.

安装

pip install pysam

二. 函数详解

1.读写bam

import pysam
samfile = pysam.AlignmentFile(input_bam, "rb")

#然后可以通过fetch方法来调取比对到某个区间对应的reads信息

for read in samfile.fetch('chr1', 100, 120):
   print read
   break
samfile.close()

得到结果:

EAS56_57:6:190:289:82       0       99      <<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<;     69      CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA     0       192     1
EAS56_57:6:190:289:82       0       99      <<<<<<;<<<<<<<<<<;<<;<<<<;8<6;9;;2;     137     AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC     73      64      1
EAS51_64:3:190:727:308      0       102     <<<<<<<<<<<<<<<<<<<<<<<<<<<::<<<844     99      GGTGCAGAGCCGAGTCACGGGGTTGCCAGCACAGG     99      18      1
...

#取的位置是交集还是并集呢?
初步理解为跟这个区间有交集的序列都提出来

也可以写bam文件

import pysam
samfile = pysam.AlignmentFile("ex1.bam", "rb")
pairedreads = pysam.AlignmentFile("allpaired.bam", "wb", template=samfile)
for read in samfile.fetch():
     if read.is_paired:
             pairedreads.write(read)

pairedreads.close()
samfile.close()

注:

1.AlignmentFile(filepath_or_object, mode=None, template=None, reference_names=None, reference_lengths=None, text=NULL, header=None, add_sq_text=False, check_header=True, check_sq=True, reference_filename=None, filename=None, duplicate_filehandle=True)

2.AlignmentFile打开得bam文件需要是已经建立索引,有.bai文件存在的bam文件

3.template将通过AlignmentFile打开得bam文件的header拷贝过来

 

2.提取某个碱基

import pysam
samfile = pysam.AlignmentFile("ex1.bam", "rb" )
for pileupcolumn in samfile.pileup("chr1", 100, 120,truncate=True):
    print ("\ncoverage at base %s = %s" %
           (pileupcolumn.pos, pileupcolumn.n))
    for pileupread in pileupcolumn.pileups:
        if not pileupread.is_del and not pileupread.is_refskip:
            # query position is None if is_del or is_refskip is set.
            print ('\tbase in read %s = %s' %
                  (pileupread.alignment.query_name,
                   pileupread.alignment.query_sequence[pileupread.query_position]))

samfile.close()

输出结果:

coverage at base 99 = 1
    base in read EAS56_57:6:190:289:82 = A

coverage at base 100 = 1
    base in read EAS56_57:6:190:289:82 = G

coverage at base 101 = 1
    base in read EAS56_57:6:190:289:82 = G

coverage at base 102 = 2
    base in read EAS56_57:6:190:289:82 = G
    base in read EAS51_64:3:190:727:308 = G
...

实现逻辑:
1.根据给得区间”chr1″, 100, 120,一次,从100到120点所对应的所有reads,通过pileup提取的reads有这几个属性:’n’,’nsegments’, ‘pileups’, ‘pos’, ‘reference_id’, ‘reference_pos’, ‘tid’

a.起点,终点?这里应该是先找出跟区间至少有一个碱基交集的reads,然后这些reads的所覆盖的区间的起点为pileup的起点

b.如果加上参数truncate=True,那起点就成了100,119,注意这里起始的变化,这里仅仅是输出做了一个限制,但是提取的reads的方式还是没有变化,就是只要跟指定的区间有一个碱基交集的reads就给提出来了

c.而pos代表该位点的坐标,n代表它的coverage,pileups代表对应的reads

2.每个点对应的每条read可以通过pileups来提,而这些reads有这些属性’alignment’, ‘indel’, ‘is_del’, ‘is_head’, ‘is_refskip’, ‘is_tail’, ‘level’, ‘query_position

3.排序

pysam.sort("ex1.bam", "output")

跟samtools的效果一样

samtools sort ex1.bam output

4.打开其他文件(这个功能暂时没有用到)

import pysam
tabixfile = pysam.TabixFile("example.gtf.gz")
for gtf in tabixfile.fetch("chr1", 1000, 2000):
    print (gtf.contig, gtf.start, gtf.end, gtf.gene_id)

5.其他函数与功能略

import pysam

def mutate(args):
    assert mutend > mutstart, "mutation start must occur before mutation end: " + mutid

    bamfile = pysam.Samfile(args.bamFileName, 'rb')  #读取bam文件

    for pcol in bamfile.pileup(reference=chrom, start=mutstart, end=mutend, max_depth=int(args.maxdepth), truncate=True):
#提取包含chrom:mutstart-mutend这个区段的所有reads。pcol.pos 提取碱基的位置;包含mutstart,不包含mutend;truncate=True,指定到我们指定的碱基位置
        if pcol.pos:
            num_good_reads=0
            num_better_reads=0
            refbase = reffile.fetch(chrom, pcol.pos-1, pcol.pos)  
            #提取pcol.pos的碱基
            for pread in pcol.pileups:  #提取一条序列pread
                if avoid is not None and pread.alignment.qname in avoid:
                    print "WARN\t" + now() + "\t" + region + "\tdropped mutation due to read in --avoidlist", pread.alignment.qname
     
                if pread.query_position is not None and not pread.alignment.is_secondary and bin(pread.alignment.flag & 2048) != bin(2048):
  # is_secondary==true if not primary alignment
  #query_position==position of the read base at the pileup site, 0-based. None if is_del or is_refskip is set

                    if pread.alignment.is_read2:
                        pairname = 'S' # read is second in pair
                    if not pread.alignment.is_paired:
                        pairname = 'U' # read is unpaired
                    outreads_qual[extqname]=pread.alignment.qual #质量数据

    bamfile.close()

三、讨论:

  1. ValueError: quality and sequence mismatch: 148 != 151
    原因: pysam读取后的seq和qual长度必须一样啊!!

2.文件的写入

注:
先整理到这里,后面用的时候,再回来梳理

参考资料:
http://pysam.readthedocs.org/en/latest/glossary.html#term-column
http://pysam.readthedocs.org/en/latest/api.html#pysam.AlignmentFile.pileup

发表评论

电子邮件地址不会被公开。 必填项已用*标注